 Hi folks, let's get started. So today my teammate Bogdan and I Vivek are going to talk about mastering exterior observability. So let's just get into it first. There's a prerequisite that we would recommend that you kind of go through. It's just something that we would recommend that you download because some are Docker composed images and internet isn't really stable right now. So if it's possible, then just scan the barcode or just follow the link along and then we get started. I'll just give it a minute. Once we're done, we can just move forward. Okay, I think Bogdan is going to help you out. So just raise your hand if you have a problem or anything like that. But we strongly recommend you follow us along while we go on this journey. It'd be fun. Okay, so just moving on, just to kind of review what we're going to achieve today or what we're going to learn today, just the fundamentals of HCD. Just talk about leader election in general and how it impacts HCD, the architecture of how everything is placed, how everything interacts, the meat of everything metrics and the fundamentals of how they're structured, how you can access metrics, how you can look at them and make sense of it. And then we have a lab session after this, just after this brief intro. We just run through a few scenarios of what problems you might run into and what metrics are really important for you to look at in these scenarios. So let's just take a walk. So the basic fundamental question you'd ask is, what is HCD? HCD is just a simple key value store. It's distributed in nature. What does that mean? It means that the computation, like the storage layer is separated or split up among different computers and not just one single computer. So all the computation, the storage happens on in a distributed fashion. The next thing is replication. By replication, it means that it replicates data on all the members of the node of the cluster, HED is a cluster and it has multiple members within it. So that's kind of the idea. Consistency, so getting and setting or like getting and putting data in happens in a consistent fashion, which is what is primary or like very essential to a database. So that's kind of what this makes available and then highly available. That just means that until and unless HED has a quorum, you're good to go. So say for example, HED has like five members in the cluster and one of them was to go down. You'd still get the data and you'd still be able to store data and all that would be fine. So just losing a node does not mean that he's going to go down. So that it provides resiliency and that's what we mean by highly available. So let's just kind of understand leader election. Leader election is done by Raft in this case. Raft is just the fundamental algorithm that HED uses to perform leader election. And why leader election is important is because it helps you keep the data consistent. So let's just look at this. Just keep in mind there are two concepts. A leader and a follower. A leader is just the one that runs or like make sure that everyone is in sync. And leader or the followers are just the other members of the cluster. So that's that. Now let's just look at leader election. So if you see like there are five members in the cluster and there's a gray timeout or like a gray bar that's going over. That is the election timeout. So after this timeout, all the members will send a message to everyone. And this is a candidature message, which means that, hey, I'm a candidate to become a leader and it sends it to everyone. And then once everyone confirms like, hey, I think I got your message and I'm good to go. They will accept the candidacy and move to agreeing that, yeah, this one person should be the leader. In this case, S1 becomes a leader. And the second timeout is about the heartbeat. So once the election is over, the leader periodically will keep sending messages to the followers to just check like if the health of all the followers is fine and if they're healthy and within the cluster itself. So this helps us kind of understand how HED operates within its members. The next item is architecture. So just in general how HED is structured and what are the main components that we should be looking at when you're thinking about HED. So one is the server side, which is just generally the API that you would access. Raft is the algorithm that HED runs on top of as we saw in the previous slide regarding the leader election. The next one is MVCC. It's just a multi-version concurrency control. Don't worry about these terms like we already have a glossary in the GitHub repo as well, just in case if you have to review it back. Just to give you a brief idea, MVCC is just for you to put it simply. It just tracks all the revisions of your keys. So you had a key and you updated the key 50 times. The MVCC stores all the revisions of the keys. So it basically stores all the updates that you made to the key so far. The next thing is a client. A client is just a simple computer that you'd use to kind of access the HED server and BoltDB. Now BoltDB is the backend that we use to store data in HED as well. It's a very simple database and that's kind of its feature. Like simplicity is what BoltDB brings to the table. There's a right ahead log, which is just a log of all the transactions you've committed so far and it just stores all of that in the server. Now this is how all the items interact with each other. So if you start your journey from the client, you hit the GRPC server and then you think of this HED server as the central brain to the whole architecture of HED. It connects to different components and this is kind of how we think about HED when you generally just go through different items. So now let's just jump into a few operations. So that was a reference diagram that we saw before and now we're going to see how HED performs all these operations that we mentioned. So we're just going to jump into read and write, but just to give you a brief background on what a transaction is, it's just a complicated combination of reads and writes and that's what you just want to do two things in one go or maybe you'd want to do multiple writes in one go. So that's what you use the transaction for and a watch. Watch is if you think about it, like if you have a key in mind that you want to check updates on frequently. So if a key gets updated, you get an event back saying, hey, the key got updated to this value and so on and so forth. So until you're watching the key, it keeps sending you updates as in when the key gets updated. So this is a read flow diagram. If you see, let me just zoom in a little bit. Might be easier. Okay, so if you see, let's just walk through a request, right? You send a request, it hits the GRPC server first, which then gets forwarded to the ATV server loop. From there on, what happens is in the third step, you get a read index. Now, what does that mean? A read index in general means that it'll talk to the leader and get the latest read revision, which means this is the latest version that the database was updated to when the request came in. And this is kind of important for consistency as we spoke about before. Once it gets the read index from the leader, sometimes this member might be the leader itself. So it just finds out what the read index is and then finds the latest KV pair, which is just a key value pair and responds with whatever the data was. So that's kind of a simple read operation. It's simplified. There's a lot going on, obviously, but this is a very simplistic approach just for the sake of this presentation. So let's look at the right flow. This is a little more complicated than read. Let's look at all the components that it walks through in general. So similarly, you put in a request. It goes through the GRPC server, it hits the actually server loop. And then what happens is there are two, three steps, if you see. There's a replicate, which goes through the peers. And there's another persist one which persists to the transaction to the wall log. So what happens is once the server gets the request, it goes through RAFT. RAFT makes sure that the quorum is kind of met in the sense that it'll agree with at least majority of the members of the cluster and say, yeah, this looks good, I think we're good to go. So in that case, once RAFT agrees that this transaction is committed and most of the members actually got it, it persists to itself and also sends it to all the members so that the data is replicated. So in the instance of one member going down, you'd still won't lose the data at that point. But it also means it's consistency. And then once that's done, it just applies it simply to the MVCC store and sends you a response back. And just keep in mind, it also syncs it to the bold TV asynchronously every hundred milliseconds or so, so that it doesn't like kind of queue up the disk as much, right? It is just cheaper this way to batch transactions and not do them all at once. So this was just a simple right flow. And now, so let's get in the meat of what our presentation is about. So just we kind of to recap, we just saw primer on what it series is, what kind of transaction it performs, and the leader election that we saw after that. Now we're in metrics. So just actually produces or like generates a Prometheus based metrics. It just means that if you just have, it just spins up a Prometheus server and then like presents all the metrics, which you can then scrape. You can scrape it from the metrics endpoint. If you just hop onto a host and just do a slash metrics, you should be able to see all the metrics that that member is kind of emitting at that point in time. Now the detail level. So you can set the detail level to be basic or extensive. Basic just means that you'd get a smaller set of metrics and then extensive is just it gives you all the metrics that exist. I think if I remember correctly, there's about 126 metrics that it emits at any point in time. The next one is avoid at CD underscore debugging. So at CD underscore debugging are metrics that are unstable or unflux right now they're being developed. So might not be the best kind of presentation at that point. So I want to just avoid it for now. And namespace just means that it's three metrics and namespaced by the module that kind of working in. So if you have MVCC, then you'd see at CD underscore MVCC underscore whatever the metric name be. And so on and so forth for the server and just any other module that you can think of within at CD. Now we're going to focus on these three components for like the most part of our presentation just to kind of limit the amount of exposure that we make with that CD. But the idea is these three components are just kind of going to be a focus area with with respective operations. Now if you just scan this QR code, it's the same QR code that we had like at the beginning. But in case just someone came in late and just wanted to scan it. Just want to make sure everything's like fine. So I'm going to take you through just the lab going forward. And just in case if you have any questions raise your hand. I'll be walking around just in case if you need any help or just holler at me. Okay, thank you, Vivek. Hello folks. So we're going to jump into the lab portion of our presentation. So I hope everyone had a chance to check out their repository and run the prerequisites. So I'm going to do it together with you right now. I already have images downloaded so it'll be faster for me. So the first order of business we're going to bring up our at city cluster. So what we're doing here, we bring up a three node at city cluster and Grafana and Prometheus service. And we're going to use premises to pump metrics from at city and Grafana for our dashboards. Ignore if I know there was a question about the logs and some of the errors here. Ignore that there's some Grafana errors. That's okay. Now to verify that everything's working really well, we'll try the benchmark. I'll talk about the benchmark in a bit. Okay, so what I did right now, I ran the at city benchmark that comes as part of at city repository. For the purposes of this demo, I built a new at city image to include the benchmark and couple other tools because the release image obviously doesn't have the benchmark tool. So what we're doing here, a little bit. We're doing a put benchmark right here. We say the command that we want to execute. We provide the endpoints that our clients will connect to. And then some information about the type of puts we want to issue and the total number of puts. So the nice thing about the benchmark, it will also provide the summary, but we will be looking at the dashboards mostly for our lab. Okay, and the next step, let's check out default at city dashboard. So if some people might have a Grafana login prompt, so use this credentials, if you get that. Okay, let me do a quick review of this default dashboard. So this is the dashboard that is linked in the documentation and you can download this from the Grafana website. I'm not gonna dive into various areas here, just a quick glance and then we're gonna go into scenarios and hopefully you'll get more exposure to various metrics through that. So at city uses roughed. Roughed is the consensus protocol. It has a notion of the leader which is very important. So of course, there's like a big message that we have a leader, because if we don't, then sins are really bad. Next, the RPC rate, LCD is the GRPC server. So this is one of the first metrics we wanna look at. So we'll have summary of all rates for all the requests. So we did some puts, so that's why we got some increase here. And after I wrote the benchmark stopped, nothing is happening right now. Then we have active streams. So we have watches and leases. We won't have time to go into watches and leases, unfortunately, during this demo. Then we have some information about the database. So this is the B-bowl database, DB size. Notice there's no data here. So this is actually a bug with this dashboard, but didn't fix it, just wanted to show you guys what's going on here. So the problem here is that this metric is a little bit older. It's using this debugging metric namespace and was already graduated to non-debug in one. So if I do that, that actually works. Okay, then we have disk of sync duration. Gonna talk about this later. This is important, especially the wall of sync. And then memory. Then our client traffic in and out. So this is from our benchmark tool. Then peer traffic in and out. This is between the HCD members. Then we have information about the proposals. And we'll also dive into this next scenario. But important point is that raft has a notion of proposal. And it's a crucial component of it. So every time you're trying to replicate a log message through raft, you submit a proposal for that. So that's why keeping track of this is important. And then we have proposals committed. We have proposals pending. And on this one you also have proposals applied. I'll explain what is the difference between the applied and committed proposals when we get to next scenario. And we have some disk operations and network. Okay, so let's jump into our first scenario. So I called it scenario zero just because this is just exposes us to some basic operations, put and ranges for HCD. We're not gonna stress out the system significantly in this scenario. Okay, so first sync, let's look at the sequence diagram for puts. So I know Vivek had a diagram for HCD architecture and the write. So I have a little bit more complicated diagram. And I know it has internal HCD details. Oops. But unfortunately to understand how metrics works and what they mean, you have to dive in into some of the HCD internals. Okay, is this, hope this is visible. Okay, so let's walk through put sequence diagram. So HCD is the GRPC server. So the first order of business is to accept those GRPC requests. And there's some boilerplate here. And then it goes into HCD server. So HCD server is kind of a workhorse. It does a lot of coordination and translation between the GRPC and the various internal layers of the system. So right away we increment this starter total metric, which is part of the GRPC middleware. So it comes as part of the Go GRPC implementation. Next, since rough doesn't have a notion of put, we can just do a put to sort of rough protocol. We have to operate in proposals. So that's what the HCD server is doing. It translates the put into proposal for and submits it to rough to run the logger application. Right away, when we submit the proposal, we increment the proposal spending. Now the rough portion of HCD, rough is implemented, rough is a standalone library at this point. So recently, I think like maybe half a year ago, HCD refactored raft into a separate repository. But this HCD server rough node is the implementation that is required to run rough inside the HCD. So when we submit the proposal, so rough does it magic, goes communicates to with the peers and then you get this async, this channel, the ready channel. And it's required for rough implementation to persist the information to the storage. So that's the number six is doing here. It's saving this information to wall log. So wall stands for right ahead log. This is a very important and crucial piece of the equation because every time there is a proposal, each member will save it to wall log. When we save, we actually fsync every time. Right after every time we save, we fsync. This is, I think, is a little bit different than most databases. I think usually database don't have sync on every write. But for rough correctness, you have to fsync and persist the data and guarantee the persistence. So that's why we fsync and we increment this wall fsync duration. This is one of the important metrics in my opinion because if this is slow or spiking, you can have obvious performance issues with HCD. So okay, we're done with saving our data to the wall log. So done when we mark the proposal as committed. So proposal committed means it's saved to the wall log. And then flow goes back to our HCD server as I guess through this like I think channel. There's a little bit more going on HCD server. I'm simplifying here. And then HCD server calls out this apply, and this is a second part of our put. So we save to wall, and then now we need to save it to the backend, to the B-BallDB and to our MVCC. So this is what it's doing. So it goes through the applyer. It dispatches this to MVCC. So multi-version concurrency control, sub-component of HCD. And what this does, it will maintain the history of your changes. So because raft itself doesn't have a notion of history, just a log. So MVCC will create a revision for each change for your key. It will maintain this in memory index between revisions and keys. So that's when we put to this in memory index and there's some logic here to get the revision. And then we persist to the backend. This is this unsafe sequence put, and this is the backend abstraction. The backend for HCD is the B-BallDB database. This is embedded memory made database. It's a clone of LightningDB. Notice that after we put to the backend, we actually don't commit. The control flow goes back to the HCD server and then we return. So we mark the proposal as applied. And we decrement proposal pending and we return to the client. Part of this response will be revision that got assigned to like doing this MVCC portion and it can be useful in various scenarios for your work. And then backend commits asynchronously. So this will commit every 100 milliseconds or so. There's also a buffer. So if the buffer reaches a certain number of operations, at least a thousand or so, it will commit. And then we increment this back and commit duration seconds metric. So again, important, but not as important in my opinion as the wall f-sync. Okay, so that was the sequence diagram. And now let's kind of put it in practice and we'll run the little benchmark here for puts. So I increased the number of puts so compared to the total flag just to keep the benchmark running so it won't exit right away. And we've prepared this dashboard for you specifically focused on puts. I'm gonna collapse the sections just to demonstrate what we got here. So similar to the sequence diagram, I try to group this by et-cd components. So there's a gRPC server, there's et-cd server, wall log, MVCC and B-ball DB. I mean, of course for your production dashboards you probably wanna arrange it in a different way. It's hard to really recommend the best arrangement because people have various SLOs, they various hardware and so on. So I think it's, and also it's a good idea to arrange your own dashboards so you actually understand what sins mean and not just take the default et-cd dashboard. It's okay, let's dive in. So we have our gRPC. So we're doing some puts right now. So the start puts incremented, so this is the rate. And we're actually doing quite a lot of load. This is running on the Docker container. So the f-sync I think is actually pretty good but because it's right into VM of some sorts. Then we have handle puts. So that's the puts that are already completed. Notice that we also display error codes here. So everything's okay right now. We have also 90 percentile latency for puts. So we look 60 milliseconds. That's pretty good for a amount of load we're doing in my opinion. Now let's jump into NCD server. So this is now about proposals. So as I mentioned, the proposal is the important concept in roughed. So we have proposals pending. So these are the ones that went to roughed but haven't been committed and applied yet. Although they can be committed but not applied it still will count as pending. So we're kind of going towards like 300 here queuing up some proposals but overall we're doing well. Then we have committed rates. So this is the ones that are written to wall. So when the proposal, when the roughed replicates it and then goes back to our city member to persisted, the proposal will be written to wall by the head log and will be marked as committed. And then we have proposals applied and applied means applied to our back end, BVOLDB. So then this is sometimes it's important to see the difference between committed and applied. If this difference reaches certain threshold, etsyGee will stop accepting new proposals which will be pretty bad for Everson. And we'll have one scenario which is kind of optional to try to trigger this. Okay, then proposals failed. So Everson is good right now. We're not like, doing anything crazy with our etsyGee. So nothing is failing. Then Slayer applies. So this will get incremented. I believe there's 100 milliseconds or 300 milliseconds when if the applying portion of the proposal takes more than that, this metric will be incremented and there will be stuff logged into the logs about the slow applied. And then wall, it's okay. This is important stuff. So this is the how much we write per second. And here's like the crucial metric, the wall left sync duration. So we're doing 15 milliseconds. This is okay. I've seen much better. I've seen worse, but for the traffic we're doing for our demo, I think this is good. But etsyGee has in the documentation, etsyGee has a special section about performance. And there's a link to the article how to benchmark your disk and what parameters to use for FIO tool, I believe. So that's pretty useful if you're serious about that. I recommend running the benchmark and benchmark at disk. Now, this is just the sync count. Okay, so we're done with the wall. And for MVCC, we only have one metric here, which is puts per second. This are the puts that go into MVCC. And for the B-bolt, we got commits per second and database size. Notice that commits per second is, we're doing like 10, much lower than the wall F sync count. That's because we commit in batches. And we have the database size. So notice that the database size is growing right now. We have a scenario when we actually hit the limit, which is by default is two gigabytes. And we'll talk about what happens then and how to avoid that. The two metrics that are important here was the database size. There is like database total size and the database in use. So when we go to compaction section, that's when the database in use, I'll explain how that works. And we have the commit, back in commit 90th percentile. So we're done similar to F sync. Okay, so that's our put. Now let's look at the ranges. So this sequence diagram will be shorter. So again, with the range, HCD is a GRPC server. So for sort of business is to accept that GRPC request and then forward it to the HCD server. So in case of puts, we were doing proposals to the raft. But for range, we're not really submitting anything to the log, we're not proposing. But HCD guarantees the linearizability of operations, including ranges by default. So for that, HCD server has to communicate with raft and get so-called read index to make sure that it's in sync with other members. So there's, I'll say, quite some complex logic here to get this read index. You will try multiple times if it can't. And if this is slow, this slow read index metric will be incremented. So important for our range performance. So again, we get, actually when we issue the read index, we'll still get the notification on this ready channel similar to the puts. But instead of saving to wall, we just write to another channel to notify the HCD server. So when this dense is done, then we can go and get our information from the backend. So this doesn't involve any other members now. So this is just local to the node. So we go to the so-called TXN package and call range. So again, we hit our MVCC component that is responsible for keeping track of the revisions. So one interesting bit is that the MVCC will query the index for revisions that correspond to the keys in the range. And then for each of those revisions, we're gonna issue so this unsaved range to the backend. Sometimes it's kind of confusing because you would say our range, maybe there's a support for range on the e-bolt. So you would be like, okay, why can I just like grab everything in optimal fashion? But because of the revisions, basically just the range translates into a get for each individual revision. Then we increment this range total and then we return back to the client. And this is similar here, the JPC metrics. So we have handle total and handling seconds. So actually when preparing this demo, I noticed that we don't have time in for this portion of range. So I've added the issue for CD to add the metric here. I think it was missed. I think it'll be quite good to have ability to see how long just this portion takes but not without the rough. Okay, so let's issue some, let's run benchmark for range. And I'm still running my put benchmark. So I'm not gonna stop that. I'm gonna just do the range at the same time. And we've prepared the rangers dashboard that follows a similar structure as our puts and all our dashboards try to follow this structure of we'll have JPC, a CD server, and VCC on the back end if it that involved. So not that many metrics for range, unfortunately. So we have the range started, handled with codes, then the latency, so I think we're doing okayish. By the way, we actually are here, we're just calling for the empty key. So and we're doing this while we're doing a bunch of puts. So that's why there's some latency that is happening. There's send bytes. Then on the add CD server, we have those slow read index. Again, everything is working fine right now. So it's zero at this point and we also have failed read index. And on the MVCC part, we have our ranges total. Okay, so that's the range dashboard. So let's jump into something a little more interesting. So in next two scenarios, we're gonna try to introduce some, we'll try to trigger some failures and delays. So let's start doing that. So the first scenario, we're gonna try to delay that fsync that we're using to write to a wall and we'll see how that affects the performance of at CD puts. So I think we're still running this benchmark. Okay, that's fine. Okay, so to add this delay, we're gonna use the go fail library that at CD is build with. And I specifically build this image with go fail enabled to allow to insert this delays. You know, the production as the image have this obviously disabled since they're probably like a security red flag. But for our demo, I build with enable and what we're gonna do here, we're gonna insert sleep in a certain predefined place in code. So this will be in world before fsync. So right before we fsync, there's the place that will be processed by go fail library and we have an ability to insert the sleep while our cluster is running. So I'm gonna do that. Okay, and let's check out our puts right away. We're worried that we might run out of the space soon. I think we're still good. So after two gigabytes since start failing, but I think we're not reaching that yet. So let's see what's happening. So let's wait for a bit for data to refresh. Okay, here we go. So we inserted 100 seconds delay for the fsync and right away we see the spikes as expected. Now let's see what's happening to our put requests. So we'll also see some slow applies. And then our proposal applied dropping because we can't keep up the prior rate because of fsync. Again, we can't keep up with prior rate. And let's look at the latency. So here we go. So put latency climbing up slowly. Let's give it a little bit. And the rate of handled puts is dropping. Just waiting for the latency to climb up even more. So this showcases the importance of your fsync and how it affects the puts request latency. So important for your production dashboards to alerts to the fsync. Okay, so we added a small delay. So we saw the impact. Now let's add like a larger delay, which is one second. Let's see what's happening. Okay, going back to our puts dashboard. So the latency climbed up for previous. The data is still not refreshed fully yet. By the way, notice that the leader usually processes the request faster than other members, than the followers. That's because it saves around trip time because you're hitting the leader. But it's not recommended to direct your request from your client to the leader. SED has on the client, it has a building load balancer. So you always wanna really send your, specify all the end points for SED cluster. So it will load balance them. So again, we now climbing up to like 500 milliseconds. That's because we're doing the one second delay on the fsync. But we still can keep up with the traffic. So nothing, we're still getting okay. Of course, our rate dropped. I mean, in my opinion, this is already kind of reaching the level of unacceptable latency. See our proposal spending are kind of starting to queue up a little bit more here. And then let's look at the fsync. Yeah, so the fsync as expected is going to one second. So the conclusion is that we can keep up with our traffic. But if you have some SLOs in terms of the latency, probably those will start firing. Okay, and also let's try running some ranges again and see what happens there. So while our puts is still running. So started the ranges benchmark. You can see that this is like, we'll bear the move in. Let me open the ranges dashboard here. Let's see what we got in terms of latency. So let's wait for a little bit for the, so we are having four seconds. This, by the way, empty key, so there's no data. You know, so this is unacceptable. And why? Well, that's because we see the slow read index. So because rafters sort of queued up with the processing those ranges because the fsync is slow, that also affects the read index. And so this demonstrates how the slow fsync, for what you would say, would you think, okay, but why would that portion of the system affect this portion? But because ever since synchronize is in a raft and we're doing this read index, your ranges also will be affected. Now I can disable linearizable requirement for our ranges. So I can change the consistency to serializable. And what this do, let's try that out. So we see here, like here, we just fly through this benchmark. So why? Because when we do serializable consistency level, we don't need to coordinate with other SED members. So we're not engaging our slow, our read index. So we can process their requests much faster. Okay, so this shows how fsync affects the performance of the system. Let's go to a next scenario which is network delays that demonstrates that not only fsync matters for SED, but if you can probably guess that because raft is coordinating between SED members, also network communication between those members also plays a role in the performance. So we're actually gonna start another cluster that has a little bit different sub. So for, to introduce some network delays, I'm not using the GoFail. I'm using this bridge tool that allows me to proxy traffic between SED members. And it's also shipped with SED, not very widely used, but you can specify this Rx delay when you start it up. And I set it to one second and we're gonna see what happens when we do that. So let's wait for that to get started. There's some logs about bridge and the traffic. So prepared peers dashboard for this scenario. So this is some important metrics in my opinion that represent communication between SED peers. So include rough proposals in here just because it's important overall. And it's always good idea to, I guess, look at how your proposal are committing. Then we have leader changes. So if you see spikes here and leader is changing frequently, something probably not working right. So maybe the problem might be with the network between your peers. So we have the roughest. This is just to indicate who is the leader at the moment. And we have the heartbeat failures. I think we talked about the heartbeat. So SED and rough does the heartbeat between the members and the leader. And if certain threshold of heartbeat failures is reached, it will start a new election. And then for peer networking. So we have this RTT. So SED kind of surprising. Surprisingly, but SED has internal prober that probes other members of the cluster. It issues get request to a special endpoint. And it records the round trip time in this metric. So we've already started our cluster with one second delay. So that's why this is going to one second. From prior data was very low. And then we have send bytes and receive bytes and there's some failures. So really, this is what I wanted to show you the indication that our RTT, our round trip time is already delayed. So now let's try running some benchmarks and see what's going to happen to our performance. Okay, so you can see this is actually already failing. There's some too many requests, so on. Yeah, so let's look at the push dashboard and a little bit for the data to refresh. Let me increase the... Yeah, so we can't even keep up with this at this point. This demo sometimes is kind of tried to, when I was running this in test and it didn't fail right away, so it was actually more interesting to show that. But of course, when you're presenting, has to not work as you expect. But I think it's still in the case that we can't really do the puts while we have network delays between the peers. Okay, so the purpose of the demo is to show you that not only F sync, but also network is important for ACD performance. So I'm gonna stop this cluster with the network delays. How by the way are, how people are doing if anybody is following along? Any questions? I have a question. Right, so one thing to check is the network. Another thing is sometimes other operations, like for example, if you run a defrag, it will can block the back end. So for compaction can I think sometimes cause slowness. So I would check the network, that will be my first. So check those RTT times, that will be my first response. But yeah, there may be something else as well. Again, depends on what version of ACD you're running. I think there can be differences between the 3.5 and the 3.4 in terms of the performance. You can also maybe be doing like ranges, for example, that slowing down some of the puts. So apply time includes the apply and portion of the flow. So that's already after we committed the message to wall. But I guess it can block on, so it's kind of more towards gear, towards the b-bolt performance. So I would check what's your database size, for example. Yeah, right, so I think you already moved beyond the recommended ACD limit for you. So yeah, the apply time is about back end and VCC. So that's kind of past the rough, look at the slow applies. So I think like for example, I've seen like for example in Kubernetes, if people are quarrying all pods in a big cluster and not paging, then you might get into this issue. Thanks a lot, that was very insightful. At some point you mentioned that as best practice, it was interesting to have the clients pointing to any member which was not a leader. Is there any like built in way in the libraries to do so? Or like, so can you repeat that? At some point, I believe I understood that you mentioned that it would be as a general practice interesting to have all the clients pointing to non-leader members. Yeah. How do you in practice implement this? So in practice, I think if you in Kubernetes, for example, you specify the NCD endpoints and usually you list all of them. So you just plug them in in your connection stream and then as a client will round robin between those endpoints. So it's say it's, you kind of control that, right? So if you specify, okay, this endpoint is a leader and you just put that, then all the traffic is gonna go there. Then at some point, for example, the leader changes. So that endpoint is a leader no more. So then your logic is flawed, right? So it's not recommended to do that. So just specify all the endpoints and yeah. I think I misunderstood that. I initially thought that you mentioned it was not recommended to send traffic to the leader, but yeah, okay. Yeah, I got it now. Right, what I'm saying is that don't like, here's my leader I'm just gonna send there and just because I'm saying, I think there's gonna be some better performance. So that's what I'm trying to say. Yeah, it makes sense, okay. Thanks for that. Yeah, it's whatever goal it poses. Yeah, so basically it's F, is it called F data sync or something in the, huh? Yeah, but on the Linux API, I don't remember what it's called, is it just to think, yeah. So I think there have been some discussions about removing F sync, but I think the rough doesn't really, then you might violate some of the correctness of rough. And another sin about this is to watch other services using your disk. So for example, if you run a CD on the node and then you're also making some backups of your data directory on the same disk, it might introduce some F sync delays during when you run that backup. Okay, so let's move to the scenario number three. And we're gonna trigger that database size limit. I'm gonna start the cluster again. That's our PUS dashboard. I'm gonna run the PUS benchmark that was increased value size to trigger this condition. Those specified value size there of 10 kilobytes. Let's see what's gonna happen. I'm gonna switch to PUS dashboard. I'm gonna look at the database size. Okay, so we're kind of climbing up. So we've done 500 megabytes so far. Let's wait just a little bit for it to reach two gigabytes and then observe what's gonna happen. Almost there. Okay, so we reached our limit and what happens at CD then stop, we'll stop accepting new requests, no, in your PUT requests and start issue and resource exhausted error code. So that's what I'm gonna try to show you. So yeah, in the handle PUTs, where we also have the error codes, we start seeing the resource exhausted. So it's a simple demo, but I feel like people get confused about this sometimes. Another point of confusion is when at CD reaches the limit, it sets a so-called alarm and then to get rid of it, you have to go and issue a disarm command, which I've seen people get confused by that. But from the metrics perspective, this is kind of what I think to watch out for. So you need to remember what's your database size limit and have some alerts about that. And also that's one of the reasons to get the resource exhausted code. So that's just a simple, and then here we see all these errors for resource exhausted, blah blah, database space exceeded. So yeah, simple, but important. Okay, now to deal with the space. What are our options? So really the option is compaction. So since at CD maintains kind of all the history of your keys, at some point you wanna compact the old history. I linked some docs about the compaction. You can run it in two ways. So there's the compaction call from the client, where the client itself decides when it wants to compact and issues the compact call. That's what Kubernetes does, it issues the compact every five minutes. But it's also an option to run so-called auto compaction based on time. So I think it's by hours, so you can specify the current compaction every hour. And there's some logic, I think, some settings to decide what to compact. We're gonna do the call from the client. And I also have a sequence diagram. I hope this diagrams, you guys can see them. They're not too confusing with the internals. But compaction is actually implementation, in my opinion, is quite complicated. But let's try to go through that. So first of all, an interesting part is that compaction is actually gonna translate into another proposal to woe. So if you issue compact to one SD node, it will get replicated and then every node is gonna compact. So similar to other proposals, you get the same metrics, start to proposal spending and so on. And then you even save this to a wall, the information about this operation. But the difference comes when you go to the applied portion of this. So that will call this compact method on the NVCC subsystem. And that will, in turn, schedule compaction. So there's the con notion of the compaction job. And this gets scheduled and by default, SD solo will return to the client. But the compaction is still running here. You can specify a flag to make it wait. But it can be also confusing that it returns right away, but you just kick it off for compaction that can take longer. Then you have a schedule compaction. So it compacts that in memory index of revisions first. And then we have this index compaction pause duration metric. It's actually defined on the SD debugging level, which is indication that it might change. But let's work with that for now. So what that means is just, when we issue compact to the index, it will block the index. And this just records how long that portion of compaction takes. And then to compact actually data on the back end, there's logic to go and determine sort of which revisions to compact. And then it sort of uses this compaction batch limit to loop, it will kind of run multiple loops of compaction batch limit size to do these deletes on the back end. And it records compaction pause, DB compaction pause. But this pause is actually for each cycle of the loop. It's kind of, in my opinion, complicated. There's probably good chance to refactor this but that's what it is right now. And then we'll have the compaction total. So that includes the whole runtime of the job. Yeah, so let's try this out and see it on the dashboard. So I'm gonna, I'm issuing the force recreate and minus V to clean up the volumes that the SD is mounted with. So that two gigabyte size limit when I stop the cluster and it will, that volume will get blown away. In the interesting portion that the Grafana setup uses, I think it's called external volumes. So that will be persisted. So that's why in the dashboard, you kind of see the history of the prior runs. Just a side note. Okay, so let's start the cluster. Okay, and now we're gonna run the put benchmark but with compaction. So basically what it's doing is running the outputs but also it has these options to compact every, there's a compact interval with specify 10 seconds. So in production, you're not gonna be compacting every 10 seconds but for the demo purposes just to show how the metrics reflect that. We use 10 seconds and then you have this compact index delta. This is actually only for benchmark. So the actual command to SD still is different. So let's run this. So have a dashboard for you. Again, similar. So again, compaction is the gRPC call again. So we have the gRPC metrics for it. So I guess let's wait for like 10 seconds for. Now on the MVCC part, we have these two metrics, current and compacted revision. So the current one is, as it says, is just the latest revision of the latest change in the database and then the compacted is the one that the stuff we compacted below that revision. So the logic of compaction also, no there's one exception to that rule. If the key then wasn't changed and it's revision sort of is still below, it will keep the latest value of that key. So it doesn't remove the revision of the key even if it's below compacted revision if it's the latest version. Because you don't wanna actually go and say, oh you know like remove even the latest version of this key. You can't really care about the history. So you're trying to clean up the history and there's some, I think that's how it's implemented but something to keep in mind. So here we can see that because we can find it every 10 seconds, our compacted revision sort of falls in step with the current and then we'll have the compaction latency. But again, this latency is sort of not true latency because this is just the GIPC and as I said, there is a flag that by default, this will return right away, not waiting for the job to finish. So what you actually care about is this, which is, you can see here is like five seconds is kind of high. That's because it's looping and compacting the backend in those compact batches. So the good part at least is that it's not blocking the backend for five seconds. So then we have the DB compaction pause. So that's the pause was in each iteration of the loop. So that's kind of more reasonable. And then we have the index compaction pause. And then the DB size. So now we should, so the interesting part is that we are right in Samsung, right? But you can see that our DB size is not increasing. I'm looking at this latest portion of the data. This was from the prior run. Why? Because we can compact fast enough to free up the space and that space will be reused. So this is what the DB total is the total day by size and the size and uses actually what is used by information that is valuable. So when you compact stuff, when revision and keys are deleted, those B-ball pages will be moved in so-called free list and then the next operation can reuse that space. So that's the compaction. Okay, there's this optional scenario that might or might not work. Let me try it out, see if I can stop this for now. So what I'm doing here, I'm writing keys with the larger value size, like 10 kilobytes. But I'm also trying to set up a compaction so it kind of keeps up. So remember when we did this for each and DB size limit, we reached the DB size pretty quickly. What I'm trying to do here is to not reach the DB size but compact, so let's run the scenario. Let's, so I'm on the compaction dashboard. So, and we have a 30 seconds interval set in that test. So we need to wait a little bit longer. Let's jump to the revisions. I think that's probably the best way to see the compaction updates things. Okay, so it compacted this jumped up towards the current. So let's give it a moment. Let's now look at our, let's look at actually how long that compaction took. Let's update this here. Yeah, so DB compaction causes increasing. The index causes also see how we're doing with the space. So it was a space, so we're kind of okay-ish. So we're not reaching that two gigabytes. That's what I'm trying to do with the scenarios, try to make it run for longer and to be able to compact and keep up. And at some point, we should start seeing errors. So now we're still doing puts with this benchmark. So let's look at our put performance. Make sure that this is still good. So we're still doing okay here, but let's give it just a couple more minutes. Okay, I was still keeping up. Yeah, I said DB compaction pauses raises to like one second because we compact in all those large key values. And then the full compaction takes like a really long time. So I'm hoping that this is gonna cause issues with our puts finally, but we still can keep up. Again, as I said, this scenario is optional. It's kind of hard to trigger that error reliably. This was, okay, let's move on. As I said, this is optional. Let's now jump into the fragmentation. So when we compact in our backend, we have that difference between the total space that database takes and the space and use. So sometimes you wanna reduce your total space and for that, you wanna run the fragmentation. I'm gonna link you to the docs and let's try it out on the cluster. Okay, I think we started getting some errors there, but anyway, I'm gonna move on, I don't have much time. Okay, so gonna write some data with the quick put benchmark and then I'm gonna run this puts with compaction. Let's check out our puts dashboard. Let's look at the DB size right away, what is it? Right here. Okay, so let's wait a little bit. Okay, so our DB size is, because we did that initial puts, we got some data in. So now, and we're doing just smaller puts, but with compaction. And we can see that our DB size and use, for example, is a little bit less than the total DB size. So this is where the compaction will be helpful when we, whereas defragmentation will be helpful when we do defrag, then the total DB size will get reduced to what is actually in use. And let's run the defrag. Also have a dashboard for defrag. So we can see, so we run the defrag on the ACD3 here. By the way, defrag is not like compaction, a sense that it's only done per member. So there's no writing that to the logs and then replicating everywhere. And I'll explain why. So we can see that our total DB size got dropped here to the DB size and use. So that was the result of defragment in this member. So what defragmentation does, it will read all the keys and reinsert back to the new backend. It's a pretty heavy operation. And while it's doing that, it's gonna block all the backend writes. So if your defrag will take, takes long time, then you're gonna experience some failures in terms of the requests, incoming requests. And this is what the next scenario is about. But we only have 10 minutes, so I wanna make sure I have some, I'll leave some time for questions. Anybody has any questions? Yeah. Yeah, so it literally goes to thinking just blocks, yeah, like takes a look on the whole backend. And it blocks writes on that member, right? So it's like the applied portion of the flow, right? So the write is still will be written to the wall log. So it will be committed. So the rough itself still works. The f-sync to wall still works, but the backend is blocked. So then what happens is if your defrag, for example, runs for some number of time, you get a difference between committed and applied. I have that metric there. And then after some threshold, that's it just like, okay, no more. And the starts responding, I think, I believe it was like too many requests, yeah. So yeah. So to answer your question, it does block the whole thing. Yeah, yeah. So actually there was a change recently, like ideally, right? What you would want is when we have a defrag, that node should be not serving traffic. And it's a client has internal load balancer, but it doesn't know about the fact that that node is running the defrag. So it still sends traffic until let's say it, if the request starts failing, then yeah, the load balancer will distribute load to other members. But there are other situations. And then there's a change recently that uses now a GRPC health implementation to actually notify client from the member to take out the member from the serving traffic for the period of defrag that's merged like two weeks ago. So it's not probably in the, so you can end up failing if your defrag takes a long time. And I think that's the last scenario that I had there was the add-in delay that tries to mimic that problem. Because what I did there, I just added a delay to a defrag using the go fail and then while running puts that should trigger some errors to demonstrate that. Any other questions? Sorry, can you use the mic there? Action, cause any performance issues? Well, I guess it's a loaded question. It all depends on like how much you compact in and so on. So you do wanna watch for those metrics. So one of the scenarios of optional one, right? I was trying to trigger an issue by running a compaction and when the compaction was taken long time and the pauses were kind of climbing up. So yeah, let's say it can cause an issue, but usually it won't. In the normal operations, like if you compact in every five minutes, you should be good. But the point of the demo, I think you should be watching those metrics just to be aware of what's going on in the system. So that's, in Kubernetes, I think that five minutes compaction, I think works pretty well in my experience. Yeah, I mean, I've seen people not compacting, but you know, like for not Kubernetes use cases, but sorry, I take that back, not defragging. You always wanna compact because you're gonna run out of the space pretty quickly. You have to, I'm not like expert on the Kubernetes storage layer, but I don't have a good answer for you. I know that, oh, I don't have an answer on top of my head. I know there's like a storage layer in Kubernetes that abstracts to way HCD and it has its interface and that's how it operates. It uses notion of revisions. So the revisions that HCD exposes, I think they're important for Kubernetes to keep in track of sense. For example, for compaction, I know that Kubernetes keep track of the last compacted revision or something like that and then makes the calculations based on that. Any other questions? Again, we didn't, I hope this was useful. I think the purpose was to really show like some failure scenarios. And again, I know this is not production, it's everybody has their own hardware and SLOs and so on and the sizes. But maybe this kind of gives a starting point, like describes areas you can look at and maybe put some of this metrics on your dashboards. Another area is watches that wouldn't really touch at all and Kubernetes uses that and wouldn't touch leases. But yeah, I hope this was useful. If something doesn't work, there'll be a filing issue against the repository. And if also, there will be another talk tomorrow from one of the HCD maintainers who will dive in into some interesting challenges operating at HCD. So I recommend going and checking that out. Let's see what else. So in terms of the metrics, if you find some metrics are not useful or can be improved or added, it'll be just filing an issue against HCD or you're welcome to contribute as well. I think it's relatively easy to add a metric to HCD. Yeah, oh yeah, sorry. So there in the GitHub, they are done using the Mermaid tool, so it's just a link. So I think they should be there. If you just click the link, it should open the mermaid. Oh, which flowchart is my or from the beginning of the, yeah, so those should be in there. And it's part of the readme. It's just a link to a mermaid. Look for sequence diagrams. I think that's how I, and for the flowcharts, I try to simplify certain components that collapse so if you actually go and go and explore the core base, which I really recommend doing, it might be even more complicated than on the flowchart but I just couldn't reasonably fit it into the chart. I saw that's right there in the readme, but if you open it, it should give you the, okay. Oh, okay, yeah, I got it. I saw that right here on this side. Oh, okay, yeah, sounds good, yeah, sure. Type O, probably. Okay, thank you, folks, and we're out of time. Just at the last slide, there's feedback, just in case if you wanna give, there's a QR code with the link. Just in case you have any feedback, please feel free to do that. Thank you.