 Hello and welcome to reducing client latency and time modes and running Cassandra on the public cloud. My name is Gunn Eichberg. I'm a principal engineering manager for Cassandra and Azure. There's a little bit more about myself. I attended my first Apache Con in 2005, the Sunshawa Community Scholar ship. I set up my first Cassandra cluster for HP connected in 2013. Now the principal engineer manager already said that for all the Cassandra stuff. And I'm the developer from the Boxing. I got nominated because I'm the catalyst at this Linux summit. It's really good and thankful for that. And I'm glad they have this program for people who are non-code committers. Let's see what other services we run here in Azure. We're running Azure Azure Cassandra, which is basically the open source Cassandra. And we have 3.11.4.0.4.1 and 5.0.alpha.2. As a service, you can just get a guest and do that. And the other thing we're running is translation layer, translates SQL to Cosmos DB. That's the other service I'm responsible for. But we will focus for this on the open source Cassandra service. Let's talk about some definition. Latency is the time it takes for the data to pass from one point of network to another. Latency budget is where you define how much latency you can tolerate or how much latency there is in your application, how much latency is acceptable, and if the latency is one way, round trip, and all those things. The latency budget you have influences how much money and how much optimization you need to put in the things that if there's no really good latency constraints, then it doesn't make sense to get really expensive things. On the other hand, if you have tight latency budget, you have to really, really optimize. Tail latency is the high percentage of P99s and things like that, and people measure that. And tail latency can derail it, so you try to minimize that. And that's another thing which is expensive to minimize tail latencies. Then timeouts is if the latency exceeds a certain time at a client, you usually often people do it like a two second timeout, even less. And then when the latency is bigger than that, then you're basically having a timeout and you have to retry or something. Do we actually care about latency? So Cloud Infrastructure introduces tail latencies. I don't want to hide that. So if you have other latencies like slow queries, for example, those are definitely bigger for Shopify. If you want data analytics workloads, most of our customers there don't really care about the best latency. If you're adding an item to customer shopping card, takes more than a second and it doesn't buy, that's an example where I would care about latency. And there might be other apps, so you really have to go for the chain and figure out where you lose latency and people do that to the e-commerce. So those are the customers we have which are really latency and where. Timeouts and retries, as I said, timeouts are caused when the latency exceeds the threshold. But this is just an error and we just retry. And some people determine the session and do a new authentication, which then gets you to a stamping herd problem. And that's what I mean with adding a loader already strained, adding loader and already strained node and then that causing a lot. So sometimes it's even people who fail over and our cloud that they suddenly you get like thousands of authentications and they fail from one region to another. So that makes things not better in most cases. Or we hit other random nodes and overload them. So that's the double-edged version. So just retrying in some cases doesn't make things better. So one way to fix it, I give you the answer up front and it's a speculative execution. So what you wanna do is you wanna reach out to more than one node and then whatever node answers first, that's what you take. That only works for queries which are marked item potent. So it's very important to mark all your queries as item potent. So if you've seen the Netflix talk, they have a proxy on the gateway in between which marked all your queries item potent. So they don't trust their app developers to do that right. And so there's definitely, there's definitely problems getting at diversity to do that right, but assuming you get that. So that needs to be all marked item potent and the thing will reach out and reach right. Reach out and ask more nodes and ever the first one wins. And the assumption you have here that all the applicants are not black just the same tail agency, which is usually a good assumption, especially in the cloud where we can spread things between availability zones and regions. And we made some sample code you see here where we show you how to use a speculative execution. So you can definitely use that. And it's not necessarily Azure specific. So you can use it in other applications too. And so this specific execution is trading basically predictability versus this load. So because you're reaching out at the same time to multiple nodes that increases the load in your cluster but because you think that not all of them are black by the same tail agency event on average, they will be faster. And so basically trading predictability that your latency is between defined values for more load. And that's basically what this slide was about. So if you reach out all applicants at the same time we increase the load but our latency is pretty predictable and that's what we want. So some people would rather spend as I said as you have to spend more money to get a latency down without spending the money to have a more predictable load than vice versa. Let's go to the next slide. Let's, so after we talked about solution let's talk about where all those tail agencies come from. And let me think about that. So it's load. So there's a notice down or notice unavailable. Garbage collection ones which stopped the world so there are other ways around that but assuming you have a stop the garbage collector that will cause a tail agencies. VM maintenance, so where somebody takes down the VM to do some maintenance like a host update or a hardware or whatever. A disk latency, that's a big one. Where the disk latency and even and there's even studies on SSDs that they have garbage collection if you don't have really good SSDs and they will be black by latency events. And then the all-time favorite and source of a lot of trouble is network latency. Let's go to the next one. So a common low latency setup we've seen with our customers is they usually do local one reads because they want the fastest reads possible but as I don't want to lose data so they usually do quorum writes, local quorum to be sure they're not doing. Yeah, they don't want to have the latency from cross DC things. So how does this look like a local one read? So normally when you don't do anything with the execution, so the client does a local one picks any of the replicas which have the key and then that replica goes to the disk and fetches it or maybe it's in memory and then returns it to the client. So the one thing you need to know is how does maintenance work in Azure? That's probably similar in other clouds. And so most maintenance is done either while the VM is frozen or live migrated to a new host. So we do maintenance in Azure. So if it's a maintenance where we need to do hardware changes we will freeze the VM, migrate to a new host takes a couple of seconds. That takes the most of such 30 seconds and while it's, and there might be the greater performance leading up to it because freezing needs a saving state needs CPU and also when you come back that might take some time before all the network and everything is back, back normal. So it might be a few minutes degraded maybe eight to 30 seconds where it's really stopped. So Azure, we do that, we don't do them all at the same time. It's not going, so usually do it by availability zones but it's not guaranteed so if there's a disaster or something which then we have to be quicker. But I've never seen that. So it's always has been by availability zones and very rolling and some maintenance requires a reboot and we allow you to delay reboots and things. Some maintenance can be delayed others can't be but it's always good to assume, build your application that with this, with those events in mind it's not like, like you can get rid of them they will come and get you one way or the other. So we watched that space and I don't think we are any worse than the other girls or they are any better than we are. Okay, so let's see how that would work when we have freezes. So I said, that's happens from time to time so with your local one read our VM freezes usually it's eight seconds. So we looked at that quite extensively and there's a little bit latency up to the freeze and then we have a few seconds on the 10 seconds unavailable won't get a mark note down most cases. So they will just say it is slowly slow because eight seconds. So when you look at all the timeouts most people don't put in eight second timeouts in their note communication more like 30 seconds or even more minutes. So Cassandra's not marking that note down. And anyway, you should probably when you're in a gout not be so tight with thresholds because it's the gout. And so by the problem with you this guy now has to wait eight seconds and maybe it comes back. So it's probably 10, 15, whatever many seconds until the thing happens. And so you have a really, really bad tail latency event in that case. And when you look at it with the rights, what happens? So we do a local quorum. So which means at least two of them need to reply. So in this case, this one is frozen so it won't reply. Though the other two, they can still do that. So because it doesn't apply to the right for a while. So it might get marked down that case and then some hidden handoffs. So we have seen that four things. And then if there's a race condition with a read. So the problem is since the right didn't reach this thing it comes back up and has still the stale value. If you do a read on it, if you do a local one read you have inconsistent data until the hints are caught up. So it's, I'm telling you. So it's not ideal. So there might be repairs which then make the reads slower. So I'm not, I don't want a sugar coated so this can be kind of rebound effects. Okay. So what can be done in the back end? So they are saying, well, this is not great but I want to do something about it. So what can we do? So we can fail fast. You can take the note out of rotation by switching off gossip native. So we have that in Azure in our service. You can say, hey, please, please take it out. So, so, so we get notified as there's something which tells us get to events. Get notified, we can take it out. And in that case, you, you for short a period of time you don't have all your notes available but you also don't have the tail agency. Another option is when we learn about something you could evacuate a host or redeploy something we don't do since redeployment takes really, really long. And, and you don't want to do like a five minute thing for eight second event. That doesn't make sense. And you don't really know that is a friend. You know, eight second could be a thing to be longer. And, and as I said, it's, it's really difficult to, to kind of handle it all on, on, on the back end because you don't really know the best would really be when you get it up, develop us on and they do a spectacular execution because it's, it's, it's difficult. As I said, and, and, and there are many more error modes which this doesn't address but spectacular execution would cover. So again, maybe think the speculative execution. So as I said, as I will tell you, so, so here we have the, so they give you something called scheduled events which we listen to and it tells you how long the duration is and when it starts and stops and we notice. So if you go by that, then, then it take, then it takes at least a minute that they tell you a little delay they, they were conservative. So they tell you way ahead of time and it takes a while until they tell you it's over. So, so it takes between what I say, two minutes here. So, so as I said, we implemented that if we listen to that and if you put a setting in, we will take the node or the node out of rotation or multiple nodes if it's the same data center, I can forget but the avoidance latency but we are not convinced that's perfect. So anyway, here you have an example so we take the plant maintenance, it's some redeploy or whatever they're doing. So you see when it's happening and Azure isn't stopped the Cassandra daemon, so flushes the disk. If it does, if it does reboot, might be worth all the drain but it gets drained anyway when it gets rebooted and doing all those things. As I said, it's, so, so yeah, it is a lot of development efforts. And it's mostly done to not to inconvenience your app team. But let's talk about another thing we have encountered. So, so our standard setup is to use a network disk. So I know people use local disk too but our standard thing is use network disk because they're more durable and that's the configuration of it. And so let's see, we have a local one read, go to Cassandra, it goes to network disk. And this was some reason it's a 22nd tail agency. So, so that sometimes that happens. Networks and disks. And, and then the request and reads pile up and that happens to see here, the read stage, the pending is piling up increases memory pressure because more, more pile up regress and then potentially the triggers mem table writes which then also don't happen because you have to tail agency. And let me show that to you. So, so then as I said, because of tail agency we have the current or rights hitting that. So the requests will be successful because we have the, if the other nodes are all happy and the nodes will be marked down and in that hand of store because of race conditions again you might get a stale value or no value. And, and so this thing see that's there has been a pain for us. It's easy to detect after the fact. So, we can always and it after it happened we know that was a distain see. Yeah. If you see it in on-prem then you might wanna buy new disks or something in the cloud. We also have a storage team actually quite large one which looks at those things and then tries to work on them. And, and, and we can try to second-guess the team. So the only mitigation you have in that case you can second-guess them and move your Cassandra application to a different host. The hope they have better disks there. But that won't help if latency but might restore the throughput. So that's the thing. And, and then the other thing we are now actively exploring maybe we are doing it wrong. So there's always the option. There's always thinking, yeah, what are we doing wrong? So we use P30 disks and we use one disk type. So, so, so I talked with most people on Amazon. They use a thermal local SSD. Now the other one. So what we implemented recently is to step around the read latency. So we now have a thermal local SSD as a right full cache. So you get, so basically how it works is you, you write for the local disk and there are more disks but then predominantly read from the local disk. So it's not affected by any, any tail latency is on the disks. And, and if you lose the thermal disk because the thermal tells you might lose it, then the, then, then we rebuild it from the, from the remote desk. So, so here's an, we got inspired by discord. So here's a link to it. Then there's another technology is fusion rate which is developed for SSDs which as a garbage collection and garbage collection events where they become unresponsive. And, and there's a latency detector. They are the software rate level. So, so that's something. We also like a lot, but, but we went with the first one. But it does that, let us know. Okay, so a little bit about network latency. That's, that's something which is a little bit trickier to deal with. Cause in our cases, a lot of times are intermittent. So, but, but as you'll get it. So we have a client, it goes to Sandra. There's something in between which to do the five second tail latency behind my time out because it's often guns are set at two seconds. And, and if you have a retry policy we'll try a different note. If you, if you do what I told you then that wouldn't matter with respect to execution. And, and the other thing is so, so, so it makes it tricky in the network, in the Goudner ring space they're redundant routes. So some network packages will go the other route which might be long gone. Then they try this route and then the, the node does a retry of the thing. So, so it's very intermittent. So I can't really say, so, so, so, so, so, so you, so, so we are your computer people. So we like when something is broken or not broken. We really don't like this thing in between. Some, some packages are fast, some are slow. Then you get, yeah, anyway, so it's no good. So, so, so the rights are similar. You, you, maybe it's between the nodes. You can have that at five seconds but you still get it. Those will mark down that one doesn't have it. Can have an extreme example of what we call a network split where, where, where, where the, where the right area and network link between regions is broken. And, and then, then you get kind of split brain and things as well documented. But anyway, so, so we only deal with the, with the, with the latency events. So what we monitor is we monitor retransmission. We don't read and then submission. We monitor retransmission. We monitor errors. We monitor our coordinator rates. And we also have active probes who usually try to contact, I guess, so basically one data, data node and one, one thing going on because I was trying to contact all the other, other nodes in the network from every, and a fixed interval and reports back if that work that didn't work. And so, so we, so as on our practice, The active probes are the most effective letting us know about network issues. The other ones are a little bit more non-good. See a little bit there, but it's hard to figure out. So as I said, most clouds have redundant routes, so do we in Azure. Which prevents complete loss, but it may have intermittent errors. And if you go to a networking team, they only measure up to the VM. So if they need to investigate something, so if something goes wrong inside the VM, they won't know if you're like one out of ports or something crazy. Or go out into this program, and then update, mess things up, and so on. So you lose accelerated networking called SRRB, IO. The network will lose that and things get slower, but they won't know any of that. They'll just see, okay, they're all good to the host. If it comes, you don't know what's going on. And it's best to redeploy it in those cases, because network problems, they have a lot of automation which fixes it. But if it persists, then it takes a while and then it's fixed. And I consider the network problems are the most difficult to diagnose and so remediate. So that's the holy grail. Anyway, so what's the summary we have here? So we talked a little bit about latency mitigation, what you can do to get it down when you basically can't get the app team to do this regular execution. Everything is pretty, pretty difficult to do yourself. We did a couple of things in our team. So it's also like we can take things out of rotation, automated that, and that can help with latency. We did a right through cache to basically get a read path out of the network and all those things. But that's only, we have the active bulbs deployed. So we can tell you when there's a network issue. But in general, especially on the VM side, it's eight seconds you're talking about. So for most people, that should be tolerable. And all the mitigations are very difficult and little payoff. We did them anyway, because it's who we are. And they're as a good jumping off point for failure detection. So we have that too, a failure detection. We have mitigation where we can move you to different hosts and all those things. But yeah, it all takes time. So anyway, so what's the best way to achieve low latency? You need to plan for a spare capacity. You need to have some headroom to basically can compensate if there's latency. We really, really recommend using speculative execution and rewrite the application. So everything can be idempotent. So there's no need that things don't need to be idempotent. If it's on your high performance path, then you just need to re-architect, rethink the data model and see if we can do everything idempotent. So I don't have to have that. It's all at least very rare that you can't re-run the same thing without changing state. You need to monitor throughput. You need to monitor latency. Definitely monitor those things so you know when they deteriorate. And we also, speaking of monitoring, we also noticed that so we monitor P95, P99 and it didn't give us the whole story. So we also started to monitor the slowest request just to get an idea because if you have a huge volume of requests like running 100,000 per second and like 100s are really, really slow, the P99 will not really reflect that very well because they're still blended, which are fast. And so we are really looking at nowadays at a slowest request. So what's the maximum specific timeframe to see indications if there's something wrong because timeouts, slow number, but the application people might complain and even it's a small number. So it's good to have that number to verify it's us or not us. So database on that database. So it's almost just P99. So nowadays measure the max latency. And now our recommendation is so that's the thing I hinted to. So you're basically in a race with. So it's not just you trying to fix things. That's also other teams. If you're on the go to try to fix things as a VM team, there's a host team, there's kinds of teams, you're basically in a race with them. So you can say, okay, I will give them a chance. I don't fix it. And in most cases they fix it and things will be normal after a certain amount of time. Or you can try to redeploy to a different host. And that might get you out of the thing. But it takes time to, and sometimes they are faster than you. So it's a race. Sometimes they are faster. Sometimes they are slower. So you're taking a risk when you start to redeploy. Because you can't just say, oh, they fixed it. No, I'm not redeploying you. You're on this path and it takes you, let's say, five minutes to redeploy. And they might have fixed it in one minute or two minutes. But you never know. That's the thing. You never know if they do that or not. So maybe redeploying gives you the predictability you're looking for. So anyway, since this, this will be part of the online portion. So thank you. If you have questions, reach out to me. I'm on, I'm on the Cassandra Slack. X Germans, you can definitely find me. And if you have questions, reach out and thank you for listening. Hope you learned something. And yeah, and thank you a lot. Talk to you later.