 OK, we're going to get started. Thanks, everyone. We don't have a very large cloud. Yeah, thanks for coming to the talk. And so I'll just quickly introduce ourselves. So I'm Wenbo. He's my colleague right now. And we're software engineers working at Google Cloud. We work on the open source project, GRPC, but we're also supporting the Google's internal infrastructure that powers the GCP services. Sometimes we call these the cloud APIs. And also the non-cloud Google's external facing services like Gmail or Search. So today, we're going to talk about a specific case study. We'd like to share with you some experience about how to harden in the data path over the hybrid cloud. Yeah, so OK. Sorry about this. So first, we're going to talk about what I really mean by hybrid cloud. I do not have exact definition of a hybrid cloud. Specifically, the use case we're talking about here is when some large scale online applications, they try to move their database layer, the storage layers into the public cloud. And when the majority of the application will still stay on the on-premise data center. Now, this is a shifting paradigm. If you look at the picture on the left side, it's a traditional on-premise deployment when you have applications talking to a database, for example, MySQL. And the network links is usually reliable. And more importantly, they share the same faith. When something goes wrong, typically the database itself goes wrong first or together with the network. But once you move everything, move the storage layer, the database layer to the public cloud, now the entire database is different. Now, part of the reasons they're moving the database to public cloud is that they want to leverage the scalability, reliability of the so-called sub-service. But then the data path suddenly becomes different. It becomes longer. The failure mode is also different, which we'll talk about. Now, that doesn't mean the hyper-cloud is harder. It's impossible to do it wrong. Or it's inherent and unreliable. I think the key problems that we like to address is here that when the application layer, they're not adapted to this new sort of data path, the longer RTT, the different failure modes. And that's basically the use case we're looking at today. So we look at the whole problem. Then we start to realize that the underlying problem, the technical problem, is pretty generic. It's not really anything application-specific. It's sort of the touch of the core fundamentals of any distributed computing and that kind of things. So instead of talking about the specific applications we've been working with, we thought, let's make this more useful, try to generalize it. At the same time, we also think we wanted to find some more concrete, like example, concrete workload, not too generalized. So here's what we come up with. So there are a few properties for this kind of the use case. First, in this case, in the hyper-cloud case, the client from the on-premise side to the public cloud, that is a long trip time, is larger than the service itself. And if you look at the diagram, so you're looking at about either a single-digit millisecond or 10th of a millisecond's network launch of time. When the actual service, in this case, it's actually Google Cloud Spanner, you're looking at a single-digit millisecond. And also between the client and the actual service, there is also a layer 7 load balancer. So this is sort of the RTT, sort of the paradigm. Then the second one is that for this particular case study, we're looking at there are two types of sort of RPCs. One is transactional. Here, we just simply refer to this as rights. It doesn't mean they're all just rights. They can also be read. And then there's a non-transactional apps, which we just simply refer to as reads. And then when the read timeout and the underlying software can't read try, but eventually there is a data line. When the data exceeded, some kind of errors will actually go back to the user-facing layers. Then the end user may actually see it. In some cases, that actually can cause business impacts. For example, the user may say the application is now responsive. And then same thing for the rights. If the transaction, if the rights failed, the whole transaction may get aborted. Imagine you're doing online ordering, and then the order may be canceled, and then you have to redo it. And the user may actually just walk away. Then the last thing is that there's a connectivity here, affinity here based on connections. So all the traffic that's generated, all the requests generated by the client over the single TCP connections, they will be sent to the same server process, because that's where some level of session state is managed, or caching, especially like metadata caching. So yeah, so there's affinity. You can imagine that all the connections will send all the, you have all these different sessions, and they will all be mapped to different connections. And then all the requests within a single session will all be routed to the same server process. Now, we do think this will present a lot of workloads. Similarly, this is pretty a generalized version of this. So now I'm going to talk about the exact sort of the things that are specific to the Layer 3 network in case of the hybrid cloud deployment. Now, so first thing is that in the hybrid cloud, there's this concept of like a flow. So flow is, you know, this is a very different term, right? You can imagine it's just some kind of IP routes. And what that means is that each flow is fixed. It goes through a fixed set of software hardware components, all links, like physical links, like fiber links. And then TCP connections, based on the source IP and port, especially with IPv6, they get mapped to a single flow. Now, the flow is also unidirectional, which means that sometimes when there's, you know, outage happens and only one direct, you know, only small fractions of flows get affected. Typically, they only affect sort of one-way traffic, not necessarily bi-directional. And at the same time, we notice that most of the failures seems to be like a black hole. You know, the whole flow is basically gets broke down as opposed to on the internet, what you're seeing is those partial packet loss. So overall, this network, the latest three networks, you know, they're much reliable, secure, the internet traffic. But periodically, they do, you know, because all the different components involved, you do see these kind of failure patterns. Now, on top of that, on the server side, the proxy, the actual service process, they also get overloaded, and that's where all the so-called tail latency comes from. And typically, they are not correlated to the L3, in most cases. Now, one thing I didn't mention on the slide is that, yeah, this is mostly a latency sensitive workload, not so much about throughput as such. You know, we don't necessarily talk about the congestion control flow control here. So when we have all this, in that particular use case, what the department we're facing is, then the applications, you know, the customer basically, our public cloud. The first question is from the networking team is always like this. You know, when there's a network outage, I'm seeing the packet loss is 1%. Why the RPC error rate is so much higher, you know, different, so that's sort of the first round question. Then the second one is for the application developers, they are launching the application and try to access the source databases. That question typically is like, why do I need to know all this? Can you just recover everything for me so that, you know, the applications that are used as my middleware, you know, middle layer, they do not have to see any errors because that's business impact, it's bad. So when we heard all of those questions, you know, my immediate answer is, for the first question is, we don't know, right? They don't suppose to be related. The packet layer, packet loss, how it maps to RPC, obviously they should be different. But then the next question is, yeah, you know, to what extent, can you quantify that? Then for the second question is, we kind of think, yeah, obviously, you know, we can do something, you know, if we know there's errors, we can try to recover. But then you think deeply, you start to realize that it's pretty challenging problem. If you try to react to failures too aggressively, then you may introduce cost in a long time when there's no failures, like you have those false alarms. And also in the worst case, you very easily can potentially cause cascading failures. But if you do not react, and by the time, you know, all the failures are gone because they're short-lived, it's already too late, right? The RPC errors are already being through to the users and they already have impacts. So when I was looking at that, the immediate picture I have, you know, for this talk is, it is just like a cage fight. The margin of errors is so small, you have all kind of constraints on the trade-offs you have to deal with. Now I'm going off topic a little bit. So I thought this could be kind of actually to put a nice picture here. So I was thinking about this cage fight. Everyone's talking about six months ago. That never happened. So I did a Google search, there's all this nice picture. But then before I posted it, I thought, you know, I just let you, I don't really know the copyright or that. I just let you guys to picture it. And maybe you want to do a Google image search. Yes, sorry about this off topic. Who doesn't know the cage fight I'm talking about? Okay, so anyway, so this is so really just sort of mentally I want you to picture that, right? The problem is like what it's like. So then what we did is that we look into the problem, we think that the most useful thing is actually try to create some kind of benchmark framework to quantify and provide answers or some kind of insights to both the problems. So I'm going to talk about sort of the setup for this particular benchmarking. And then my colleague will sort of show you the results we're getting from there. So in this particular benchmark, the current is running on the, you know, GCE Google Compute Engine VMs. And the failures are injected. We didn't use any physical, you know, the physical hybrid car link because for the most part of the software components are the same between VM, between on-premise VM and GCE VM all the way to the, you know, the so-called layer seven proxy, which is a boundary between the user facing part of the infrastructure. And for the read, which are simply using gets, you understand that is the response is large but the request is very tiny. And then for writes we use post, which is other way around. And we use 80% of our reads and combined with 20%. Now all the requests are generated using a so-called Poisson stream. Now what that means is requests are generated asynchronously and then are waiting for response. And the interval between every two requests that they're using the exponential distribution. Right, so this is used for that if you spread the so-called Poisson stream, if you spread the traffic across multiple clients or multiple TCP connections, as long as a total request per second rate is the same. On the server side, the requests will arrive in the same pattern, the server wouldn't know the difference. Now this is very useful way to do the benchmarking if you're curious, right? So this allows you to generate enough concurrency on the network and also on the server without necessarily overloading the network or server at the same time. So that's sort of how the workload is generated. And then, like I said, so the packet loss is injected on a single connection in the wrong time, repeated it and that's how we get the data. So now I'm gonna pass to Vinod to talk about the results we're getting, especially sort of to answer the first question, like how do you quantify between the packet loss rate and the RPC timeouts? Yeah, hello? So Wenbo asked me to collect a lot of data and I would like to present some of that to you which I thought was interesting and maybe there are some obvious and not so obvious conclusions. So the first slide I just put in there to show what kind of network we are dealing with. This is a probability distribution of the round trip time measured in the kernel. We use DBPF to measure the network latency between two acts. So it's pretty stable at around 1.2, 1.4 milliseconds and the tail is also not that long at 1.9 milliseconds. But when you go to see the RPC latency, this is the end-to-end RPC latency at different percent of injected loss, we can see that the tail start increasing rapidly and since this is a very latency-sensitive application, we can see that at 5% packet loss, the P99 latency is already over five times more than the loss at 0%. So for our particular application, which is database latency-sensitive, things go bad at around before even 10% partial packet loss. And these graphs also show what Wenbo mentioned earlier that the network tails are not correlated with the RPC latency trails. So I want to explain how we came up with the percentage RPC timeout. So we ran a lot of simulations and we ran them at different QPSs and different durations. So we had to normalize it across all these parameters. So the percentage RPC timeouts is a ratio of all the RPCs that failed over the entire run divided by the RPCs that we start during the induced failure periods. And similar to the previous slide, we can see that the network for our purposes stopped getting useful or get very bad around 10% and there is very little difference between 20% packet loss and a complete black hole. Now I want you to take a mental picture of this image and I will compare it to the next slide where we change the transport from using HTTP2 to HTTP3 over quick. And we see a huge difference at the different packet loss percentages. This is because of the head of line blocking that TCP forces an order for all the packets to be received, which means that one RPC will block the next RPC and you will see this effect. So if you're using HTTP3 over quick, you can see that even in lossy, high RTT networks, you can still get a decent performance. That's why quick should be used in such connections like over the internet or very long RTT payloads. And to drive this point even further home, we are, this is a graph of where we varied the requests per second for HTTP2. And you can see specifically in this 10% loss case that once the request per second goes about 20 requests per seconds, that's where you see actual, actually see two RPCs contending for each other on the same connection. And you see a huge spike at after 20 requests per second. And this is the head of line blocking. And if you do that for quick, it is much, much better. The, just to make sure that it was a fair comparison, the black hole in case where 100% packet loss was induced is pretty much same across all the scenarios. So yeah, I would like to ask Wenbo to talk more about the failure detection and recovery. Thanks Wenbo. Yeah, I just want to add a quick note about, why we look at a quick in this case. So we still think that TCP-HP2 is the transport and the GRPC for the input, which is based on HTTP2. It's a transport for data-send communication even for hybrid cloud. And using quick for data-send communication, I think it's probably still in a pretty early phase. We do have some initial sort of experiments, but broadly speaking, quick will introduce overhead. A lot of the infrastructure today is really just built to support TCP. So we don't want to sort of have to take away, say, hey, if I have this kind of workload, I should use quick because there will be packet loss. Really two things, right? First, most of the packet loss, we actually experienced the total packet loss black hole. In that case, as you say, quick doesn't help. Second is that you do want a connection pool, not just because, you know, head of line blocking and other things. The reason you want a connection pool is really for low distribution or low balancing, right? It's not necessarily try to make the traffic, you know, the connection more tolerant to this kind of failures, which only happened once every several months or several weeks. Okay, so just want to clarify that. And but we thought that the results would be very useful because you can immediately see the role of the multiplexing place in the TCP case when there's a packet loss, a partial packet loss, and how the RPC will get affected. So then, yeah, so the next part is, I'm just going to talk about, you know, how we plan to address the second question, right? Is this like what you can do with this kind of failures is possible at the latest seven, you can actually recover from it completely, okay? Now, so this is just for benchmarking. We come up with this kind of, you know, what I would consider pretty intuitive recovery scheme, although it does involve multiple parameters, so I'll take a little bit of time to explain. So really there are two parameters you can tune for to detect a failures on the current side, right? We are knowing all the details on the server side or infrastructure side. One is that how many acts you're waiting and which gets time out? By act, we simply mean anything that's the response data that is time bound, that's, you know, associated with the request. It can be headers, can be some part of the response bytes, or can be the end of the response. There are anything that's time bound. So generally we just refer to this some form of acts. Now, as the current send multiple traffic asynchronously, there are acts, you know, those requests will be in flight. You're waiting for some form of response coming back. And the total number of those pending acts is the first parameter. In this case, we call this the B, right? And then you have the time out, how long do I wait for each individual act before I decided this act is time out? So the particular algorithm we come up with is taking into account that when sometimes the connection may be seeing a lot more traffic and sometimes it's mostly idle but still have enough traffic. Now, if it's total idle, you don't care, right? You just let it run. So both parameters, we make it so that it's dynamic. It can be one act that's got time out or it can be two or it can be three. And then the total time out is adjusted based on the number of acts. The more number of acts that's in flight that's got time out at the same time, sort of the more confidence you have that, you know, there are probably a real failure is happening that I need to react. So in this case, you know, if you do a little math, right, we're saying if it's a three act, a four act, I'll wait for four RTTs. Otherwise I'll go all the way to 10 RTTs, right? If there's only one act that's pending. But also as you can see, because all the requests gets send generated asynchronously on the server side, it's multi-threaded. They also get a process asynchronously. The TCP does have the five for ordering, right? So which means that there can be out of order, you know, the request that's appeared on the TCP on the connections, they may not match the order on the request and the order on the act, they don't match each other. So now we're saying, hey, now let's say with this recovery algorithm, what is the matrix we can evaluate the effectiveness of this in the long time? So we come up with two metrics. So first one is called mean time to failure detection. I don't know if that's a standard term, so we sort of made it up. So this is saying, hey, when the failure is actually injected, how long does it take for the algorithm to actually detect that it's a real failure, right? And so that's, and the shorter, the better. And then the second matrix is basically more important. It's a false positive. If I run this algorithm on a connection without any failure injected, we're at a forcibly detect there's a failure. And for the second matrix, we choose to normalize it. So if it's 1.0, it would mean that for the whole duration of the benchmark run, which is hours or 20 hours can be even longer. And the algorithm never detect any failures, which is great, right, which is what you wanted. And if it's 0.2 mean through the whole span of the benchmark, there's five, four, five times that there's a false positive cases. Okay, so we didn't run this directly in the long time. So we use a simulation to do that because once you have all the samples, you can just run, you can just use a simulation. So we did use the basically similar to descriptive event simulation. So we can get the results. We can tune the different primaries and get results in just a few seconds. Otherwise, the total samples we've been using feed it to the algorithm is hours of samples. And remember, our RTT is very small. So we're talking about like a potential millions of samples. So okay, so on the left side, we're basically saying here, we're gonna fix, we're gonna fix the number of X. It has to be three or three more X that's got time out at the same time before we declare the connection has failed. And then we start to increase the total time out because our RTT is so low. So we start to put an effect, one, two, three effect of two. So the first bar, you can see that it's possible to reach 1.0 if you fix the number of X to the X to three and then gradually increase the time out. At that point, you can have the idea situation that there's no false positive and also the delay to detect the packet loss is pretty reasonable. It's looking like a few hundred millisecond, right? But if you try to reduce the number of X, especially to deal with a case when there's not enough active traffic and then you have some kind of false positive, but at the same time, this does allow you to detect the real failures more quickly. And then on the right side, we did it the other way around, right? So instead, we'll sort of fix the total time out and then we start to decrease the number of X. You know, number two, you can have two X or only one X and then you say, hey, if this particular X time out, I'll just declare that this connection has failed. And you can see that it's pretty sensitive to that, right? So apparently, if you reduce the number of X to two, then there's a significant false positive rate. And then if it's one, you have a different result. So the idea is that once you have the samples, you have this kind of workload, you have these whole samples and you can run the simulation start to apply different algorithm and to evaluate how safe it is if I wanna go aggressive or how effective it is in terms of being actually detected the failures. So the next is sort of, yeah. So once a connection is failed, like what you can do, right? Then this is sort of an interesting part. There are a lot of things you can do. In this particular case, given the workload we have, this sort of strategy we come up with, we did implement sort of a more conservative version of this. Also, I skip a lot of details, right? So you have to kind of handle the throttling a bunch of other things. But overall, this is, we basically decided, we just let the requests run to completion even on failed connections. And then we fail over all the new requests to separate a connection. And then the most important thing is that we decided that we should decouple the failure detection timeout and the actual read or write timeout, which are application-specific, right? Which is like requirements, like it's a deadline from the upper layers. And that's important because a lot of those sort of similar solutions, they typically track the actual failures and then to sort of do the fail. But in this case, we just thought that's decoupled that so we can understand in how far, where the limits are for this whole problem. And lastly, I will say, if you really, if all the parameters works nicely, potentially the errors can be more or less completely masked. If, for example, the read timeout is long enough, and also the write timeout is longer than the actual failure windows. And we find that typically the failures only last a few seconds or 10 seconds or things. And that really potentially can give you a zero failure case as far as the application is concerned. Now, the first positive is what you will always have, right, in the real production environment, it seems to be even less stable. So the cost in our case is, you may have duplicate reads, which are fine, but there's actually overhead. And if the writes get misaligned because it gets sent to a different connection, and then it will have to be forwarded on a server side. So with all that, we kind of just wanna share sort of what we think in this case, like what is really the key takeaway from here, right? So I basically say, this is a sort of example of the end-to-end argument, right? Is that try to leverage the service level semantics, try to handle things gracefully, try to be able to adapt to the runtime conditions and do it at a high level, as opposed to try to solve the problem at the lower transport layer, which typically it doesn't have all the context. You cannot really do it for this kind of extreme case, very low latency, you know, you have to manage all the trade-offs very carefully in order to sort of reach the minimum error rates. Now, everything's relative, right? So on one side, we also think that later three recovery, if it can be done very gracefully and quickly and transparent, it's also still a good thing because as you can imagine, this involves a lot of user space code when if you can do it in the later three will be nice. Google does publish a paper in the sitcom this year, it's called preemptive rerolling, right? So in this case, when a particular flow is broken, then we will reroll all the TCP connection package to a different flow. And in that case, everything is just work, transparent to the applications. So this is a good example, right? It's not low-level recovery, it's still desired if it can be done optimally. So I want to sort of conclude with this part. You know, now you can say, hey, you know, this is a benchmarking, yeah, it looks all nice. You have all the data. What happens to production environment? Can I really just use this as sort of a reference implementation or solution for your actual production workload? So my answer is probably not, right? Probably not. If I were asked the same question again after this, my answer would still be no, because really this whole thing is all about data driven, right? You need to know the data. You need to know the work code. You can really just sort of take this kind of benchmarking data conclusion and apply it as is. On the other hand, we do think that, you know, this kind of quantification analysis is useful in the sense that I give you the insights understanding what different parameters, how they play together. Okay, so that's our goal here. So then this goes back to another area we're looking into that, right? It's like, okay, so we all understand the value of data, then where you get the data from. So we look at, in our case, the kind of SAS data path we're looking at. We find that the big gap is in these two boxes are highlighted here. It's basically the transport layer in two ways. One is that it's usually very difficult to get the data in that layer, especially with a very low overhead lightweight monitoring. The second one is the correlation between this layer, the user space, the guest, with the layer three layers, because you can imagine on one side, you're talking about surveys, RPCs, on the other hand, you're talking about a packet, right? A packet may contain, you know, bytes from different RPCs and bytes first. So this is the sort of the monitoring gaps. Yeah, we think that that's really important. If you look at a production environment and monitoring is really the critical part. So on that note, we also like sort of to publish this project, which we have been working on for a while. We just recently posted this in Google. Yeah, you can take a picture. The site is available, there's a link there. So this is sort of our first attempt to provide a solution which is lightweight, stand alone to give you insights in this so-called transport layer that we think it's very important, especially for debugging, for dealing with failures of things. We are still, you know, we'll continue working on this project. Hopefully that will become more useful for you, for everyone. And specifically in terms of language support, currently only support our goal, and we're adding Java and other language support. And we think that not only try to get the data, but how to process, explore the data in a very lightweight is also critical, because the amount of data is going to be overwhelming. So we're looking at that problem as well. And we do think that critical HTTP-3 is also important, and so we're looking into that as well. So yeah, so this is, so thanks to everyone. We like your feedback and also even better contributions in the three areas. Like one, take a look. If you think of the EBPF monitoring project, it's interesting. It has all the information on the GitHub. The benchmarking framework, we will over time try to open-source it. We put some there on the client, you know, all the client-side stuff. We didn't quite put the simulation part there as well yet, but you know, take a look and if you wanna have any feedback, you can reach out to us. And the last thing, I also think that if you actually have a particular use case in mind, you're kind of dealing with the same problem. We like to hear from you, you know, even better, you can share the data sets with us and we can work with you along some simulation. So just understand if there's a way to create a more general purpose of solution for these kind of problems. The other thing is we, in dealing with this problem, we also don't know like how widespread this problem is, right? This whole hybrid cloud of problems. Maybe it's just one particular case we happen to run into this. Otherwise, everyone seems to be doing fine. All this is actually a pretty profound problem that everyone is dealing with. It's just, we don't know that. So yeah, without, we really wanna hear from you. Okay, so thanks everyone. We'll start to take questions. I have a quick question. So would you mind just describe the formula like the algorithm that introduced before? Like how do your team come up with that formula? Yeah, this one. Yeah, the question is about how do we come up with this so-called formula? You know, we just, we kind of know, right? There are really only two fundamental signals, how many pending acts and how long you wanna wait for. And then we try to make it more dynamic. Instead, you wanna have a static say, hey, wait for 100 milliseconds, three acts. And then we just time out everything. Instead, I would try to be a little bit more adaptive, especially considering that sometimes the connection may say a lot of concurrent requests. Sometimes they may not have that much, right? So yeah, so that's the algorithm we come up with. And also because we're doing this as a benchmarking and simulation so we can easily change the parameters. Now in the actual implementation, it may not go as that far because this does introduce a lot of sort of variables, right? You want? Yeah, okay, thank you. Yeah, hi, it's very common in microservices to look at configuration specific to microservices like Java threads, et cetera, and people research on that. I'm wondering if the network, if these knobs are also something if exposed would benefit the microservice performance. No one talks about the network today, right? We just assume GRPC works, TCP work. But these are knobs that can possibly be tuned for microservice, to make the microservice performance better, especially in multi-cloud kind of deployment. Yeah, thanks. So the question is, yeah, can we sort of generalize this, expose this as sort of more formal configuration knobs for microservice, for the API to workload? Yeah, we definitely look into that. We, I guess let's sort of go back to the last question I have. We don't know how, you know, why this problem really is, right? Because like I said, those things, they do solve the problems when they are actual failures, but it does introduce, you know, false positives. There's a cost to with that as well. Yeah, so, but we're definitely looking into that. Thanks. Yeah, the other thing I guess I'll correct if I just follow your questions. Part of the reason we're doing this is, this is like a passive monitoring. You're doing this as a data plan. But most of the workload, you don't really need to react in such a short window. Right, you can rely on centralized control plan, health tracking, those things. Usually that just work. And this is more like a very low latency, very long RTT compared to the latency requirements. Like you try to really squeeze the last bit of the sort of the performance that reliability. Okay, thanks everyone. Hopefully enjoy the conference and the lunch.