 Hi everyone! I'm very happy to be here at KubeCon Europe to talk about everything of the HCloud learn operating some large ETCD clusters. So, quick words about me. My name is Pierre Zem. By the time I'm recording this, I'm a technical leader at HCloud where I'm working around distributed systems. So, I have experiences with things like HBase, Hadoop, Flink, Kafka, Pulsar, FoundationDB, and now, ETCD. Back in my city, in Brest, I'm also involved in two local communities for developers and to help kids learn to code. By the time you are watching this, I will be starting next week a new position in CleverCloud, a European platform as a service company. So, feel free to check them out. Here's our schedule. First, I will introduce ETCD and its place in the Kubernetes ecosystems. Then, I'll talk about how we built the managed Kubernetes product and the implication of its design on ETCD. Then, we will move on everything that we learn from productions, from observability to configuration and tips and tricks. Quick words about HCloud. We are a global cloud provider. We have 31 data centers around the world. We are hosting some private cloud, public cloud, dedicated servers, web hosting, and so on. And we have a certified managed Kubernetes offer if you are interested. Let's get started. So, what is ETCD? ETCD is a strongly consistent distributed key value store. The name stands for the slash-etc folder in Linux where you are putting all your configuration files, but in a distributed fashion. And it's a CNCF-gratuated project. Where does ETCD fit in the Kubernetes ecosystem? So, here we have a full cluster. And on the top, we can see all the control plane components. So, you have the CCN, the control manager, the scheduler, the API server here. And on the bottom, you have the nodes where your pods are running. ETCD is right here. And the reason why there is this nice brain drawing is that ETCD is holding the entire state of your Kubernetes cluster. So, every event, every pod, every customer source definition, everything is stored into ETCD. The way ETCD is built allows the Kubernetes ecosystems to be reactive thanks to watches. And only the API server is capable of talking to ETCD. I mentioned earlier that we have a managed Kubernetes product at OVH Cloud. And to build it, we did something that we call Kubeception. And that means that we are using Kubernetes to deploy Kubernetes. So, here we have a standard Kubernetes cluster. But in our case, instead of running WordPress, Kafka or any other applications, we are running the control plane of our customer in it. So, here in yellow, you have customer one with the API server and so on. And then we have here, for example, customer two. Control planes is handled. Now we need to worry about the data. And instead of deploying thousands of ETCD nodes, we decided to mutualize them and use a huge ETCD cluster for hundreds of customers. And this design is putting a lot of constraints and stress to ETCD. So, we are forced to shard across dozens of ETCD clusters. And they're supporting each up to thousands of ranges, hundreds of transactions per second, and thousands of messages sent through watches. You may think that these numbers are kind of low compared to what you can see on any benchmark. But when you are running those three workloads together, you are hitting some performance issues which is tracked with these GitHub issues. So, now that we have the context, let's dive in. So, first of all, I want to speak about observability. So, first, we need to enable observability. There is two flags that you need to enable on each member of the ETCD cluster. This one is about metrics to gather as many metrics as possible. And the second one is related to the logger and have all the traces that we need to debug in production. So, now that we have enabled observability on an ETCD cluster, you may wonder, hey, what do I need to watch or observe? From an SRE part of you, we can divide each ETCD into four layers. So, the first one is GRPC. GRPC is used for communication. It's also a CNCF project that you can use to develop and expose remote procedure codes or RPC. The second layer is called RAFT, and this is like the core part of each ETCD. RAFT is a consensus algorithm that you can use to make sure that all the nodes agree on data. And there is the storage part we can be divided into. There is a wall, a write-head log, which is used for recovery and ordering operations. And a key value store, which is called Bebolt. Okay, so, let's get started and let's observe GRPC. So, I've highlighted a few metrics that are important to watch. So, things like the total number of RPC codes, the number of bytes in and out. But perhaps the most important one is this one. ETCD is a distributed system, so you need to pay attention to the quality of your network. And this metrics will show you the health of your networks between the ETCD clusters peers. So, watch out for network with this metrics. So, let's dive into RAFT now. I won't go into details about RAFT because I think it deserves its own talk. But I will highlight a few items. So, first, you need to know that in RAFT, you have a leader and followers. So, leaders will accept writes and read will be taken from any nodes in the cluster. So, if you want to write data into a database, you need to have a leader. So, you need to check the first metric. And the second metric is important to follow because it will show you how much your leader is changing places. Because if your leader is moving from one node to another, this means that something is going wrong on your cluster. You can also use the command line tool to ask who is the leader. I really like the table options here so that you can easily see a lot of information, the version, the database size and some RAFT thing. Another aspect of RAFT that is important to understand is that the leader is sending heartbeats to followers. And sometimes the followers are taking too much time to respond. And 90% of the time, it's because of slow disk or network. So, you need to watch for heartbeats failures because like the network, they will say, OK, there is something going on on your cluster. So, now we have a fully elected cluster. We now can add new data. And the process of adding new data is called a proposal. And because of the replicated state of RAFT, the proposal we move between different states from pending to committed, then applied. Which means that every follower has now applied what the leader committed. So, you need to pay attention to those three metrics. This one should be as low as possible. And there should be no difference between those two. And of course, you need to pay attention to the failed proposals. The RAFT engine is including some tracing capabilities, which means that every RPC call that is taking too much time. So, that's 100 milliseconds per default will be logged and with all the steps. So, here we have an example here where we are trying to read one value and it's taking way too much time. The tracing part is really interesting because it will show you which step is taking too much time. And of course, RAFT and its CD are exposing some metrics when you are doing some slow calls. So, there is three important metrics for slow events. So, the first is for writing data, which is called apply. The second one is related to reading data from a follower. And the third one are related to watches. So, you need to be careful on those three. Okay, so we talked about the RAFT engine. Let's talk about storage and we will start with the writer head log. One thing that you need to know is that every mutation, every proposal will be written to disk. Which means that each CD will be waiting to have fully written any mutation to disk. That is why you need to use fast disk. So, SSD here is the very list that you must use and NVMe is really important here. And please don't use any network volume device because you will have poor preferences on it. To check how good you are doing, there is this nice metrics that you can use which will give you the latency distribution of F-sync, which is like the actions of syncing what you wrote on the wall. On our side on the managed Kubernetes team, we are using some VM called IOPS, which has some nice NVMe drives designed for databases, so perfect for us. Let's move to be both the MBD key value store used by each CD. One thing that you should know about BeBall and each CD is that the default storage limit is set to 2 gigabytes and the maximum suggested size is 8. So, I recommend that you are already setting 8 and there is those two metrics which will allow you to follow and keep track of the total size of your database. So here you have Cota, so the maximum, and then here the database size. So, we went through a lot of things about observability. Let's talk about some tips and tricks. Since each CD allows clients to be notified of every modification, this means that each CD needs to keep an exact history of its key space and all the mutations of it. So, as you may imagine, this history needs to be periodically compacted. Otherwise, you would just run out of disk. So this process is called Compactions and it can be automated directly by the database or you can control it by the client. As we are using hundreds of API servers on one each CD cluster, we decided to make it automated by the database. So we are running a periodic compaction every hour. One thing that you should know about BeBall is that it uses special options when opening files which is called memory mapped files. This method of opening files means that the underlying operating system will cache the files in RAM. If you are using 8 gigabytes as the data limit, BeBall itself may take up to 8 gigabytes of RAM. So please do not cap its CD memories below max data limit. Another operation that needs to be done on each CD cluster is called defragmentation and it's linked to how BeBalt is working internally. So after compacting the key space, BeBalt may exhibit internal fragmentation and to reclaim spaces, you need to run defragmentation and this can be run on the whole cluster or locally, on our case, we are running it locally on a rolling fashion. One important aspect of operating databases are of course backups. We used different tools and approach throughout the years and now we are using the embedded tool directly. It's based on the snapshot feature in RAFT. One thing that you should know is that the snapshots are local and it is better to trigger it on a follower because you do not want to put some constraints on the leader. One really nice thing about ETCD is that you can easily change the column size. So for example, if you have a failed machine, you can easily say, hey, remove this faulty member that is gone dark and here's a new member that will need to be synchronized. And we did some automation to ease the process because removing the faulty member and add a new member is kind of tricky. You need to pass IDs on it and we prefer to smoothen the on-call duty. And everyone is really happy with the fact that everything is automated on this side. Last tips, it's not about ETCD clusters but ETCD clients. Grafana has a nice blog post on how a production outage was caused by a bad ETCD client setup. I must admit that we had the same issues, so please enable those two parameters here if you are using ETCD directly because the client is not performing some health checks by default on the connections to the cluster. So everything can go really bad if you are like moving the cluster around. So please enable those two parameters. We went through a lot of things across this talk, so let's do a quick recap of everything we learn. First, the gold rule is to observe everything from GIPC to IO latencies. You definitely need to triple check every latencies from the disk to the network. You should raise storage size limit to 8 GB and quick tips for anyone that is running some very large Kubernetes cluster. Here you can divide and shard over multiple ETCD clusters. Run your ETCD clusters on dedicated machines with SSSD or NVMe. Don't be stingy on the RAM limit. You need to run compaction and different on all members. Don't forget to do some backup. Automate membership alteration for on-call duty. Your SRE team will thank you. And if you are using ETCD directly, please enable those two flags. That's it for today. I hope you learned some stuff. The real me should be around to answer all your questions. And don't forget to come to our virtual booth. I wish you a happy KubeCon and stay safe. Bye!