 Hello everyone, thanks for joining. I am Prabhakar Paranival. I'm a consulting engineer in Oracle Cloud Infrastructure. So what is in it for you, right? Like what you have to spend next 30 minutes here. So I would touch upon few internal aspects of HITCity and we would share some of the learnings we had over the last 7-8 years operating tens of thousands of HITCity clusters. And as we evolved, we re-architected our HITCity architecture. So we will talk about how we went about re-architecting and also migrating the HITCity deployment from the legacy architecture to the new architecture. And also we will talk about some of the learnings when we did this migration from the legacy to new architecture. Just a very brief introduction. So like I said, I work with OCI and OCI has a global infrastructure across the globe with multiple regions. And in each of these regions, OKE, which is our managed equipment it is offering, is a day one service. And there are multiple internal as well as external applications are onboarded to OKE. And in fact, many of the OCI services run on top of OKE itself. So with growth in the adoption, we were looking at operational efficiencies. And one thing we identified was to improve the way the HITCity was deployed and operated. Like you know, each OKE cluster has a backend HITCity and hence we need to find a way to operate it better, to operate Kubernetes better. So we went through this journey of re-architecting and at this moment we completed the migration to the new architecture with zero customer reported issues. I think having a great engineering team really helped. But the equal credit goes to the HITCity SIG because of the high bar they maintain with respect to the patch releases and major releases they make. Because we had HITCity clusters running all the way from HITCity 3.3 to now 3.5.x. But we could do all those migration with zero customer escalation. So I just want to thank the HITCity SIG for their focus on the high bar they maintain. So to start with some brief HITCity internals, as you all know, HITCity is a distributed key value store. I do hear that some folks run as single member HITCity in production. I don't believe that, but I do hear it from some of my friends. But it doesn't make sense running a single member HITCity in production. Yeah, it has to be a distributed key value pair. And HITCity uses GRPC for both its client as well as peer-to-peer communication. And you would know it internally uses HTTP2, so in case you are interested. So RAFT is the heart and soul of HITCity. So many features you think is feature of HITCity is actually inherited from RAFT. So like I said, HITCity is a distributed key value store. So you need to have a consensus between the HITCity members. So the consensus is through RAFT. So RAFT defines, for instance, how the replication of updates happen. It defines that it has to be a leader-based replication basically. It defines how the leader election is done on such things. So how RAFT works is it treats the system that it manages as a replicated state machine. So it defines how the updates to that state machine is done. So internally it has a concept of replicated logs as well as data state basically. So it controls the update. First the update goes to the replicated logs and then that is what is called as committing the update. And then this update goes to the backend store which is called applying the update. So those are RAFT terms inherited into HITCity. So within HITCity we use B-Bolt for storing the state. So that's why you'll be surprised, you have, newbies might be surprised, that we have a key value store inside another key value store. It's because HITCity is like a distributed key value store. Under that each of the members have a standalone key value store. In our case it's B-Bolt. So now I would jump into how we did, I mean, I'll start with how my legacy architecture looked and then how our new architecture looked. And then in the subsequent slides I will dive deep into the migration process and our learnings and other things. So what you see in the screen right now is the legacy architecture. The vertical rectangular boxes you see in the left, they're all the control planes of our Kubernetes clusters. So each control plane has three compute instance. And in our legacy architecture we only run the Kubernetes containers there. And what you see in the right side is our internally managed Kubernetes cluster where we run HITCity as HITCity parts. You can think of it as a variant of a QBception model where we run HITCity parts on an internal cluster for the customers Kubernetes clusters. And we had an internal gateway which was Envoy-based. And since we use TLS everywhere for communication between the QBAP server and the HITCity, we use the SNI mechanism in, supported in Envoy to route the traffic to the right HITCity store. Like if one of these Kubernetes clusters want to talk to their corresponding HITCity data store, they would talk to the gateway and the gateway knows which HITCity to forward to using the SNI mechanism. And in our internal cluster we ran the CoreOS-based HITCity operator for maintaining the life cycle of the HITCity parts as well as for the backup and restore mechanism. So as part of the migration what we did was we decided to move the HITCity members into the control planes which constituted the, I mean the compute instances which constituted the control planes. So like I said we had three compute instances so we moved three copies of the HITCity into those compute instances. So each control plane talks to the corresponding local HITCity. In our case the QBAP server which is running in that compute instance only talks to the corresponding HITCity. So we do that for all the members and once all the members are migrated we dismantle the corresponding legacy architecture in the given region. So now I would talk about the improvements that we made. So we were operating this legacy architecture for seven, eight years and then we decided to migrate to this new architecture. So we obviously had some learnings along the way. I listed a few of those here for the space constraints so there are much more I could add. So the first and foremost we continuously patch our nodes for security updates and things like that. So we have to make that error less error-prone and simplified. So what we did was we assigned a permanent identity to the HITCity as well as a permanent storage. This is a fancy way of saying we created DNS entries and block volumes and assigned to these HITCity members. Every time we go and create a Kubernetes cluster we provision three DNS names as well as three block volumes and we bring up three compute instances, attach the block volume and assign the corresponding DNS entry. Next month when I am patching those control plane compute instance all that I need to do is to terminate this compute instance, bring up a new one and attach the block volume and DNS entry there. From the perspective of the other HITCity members it's just a blip. The member disappeared for a few seconds and it came back because I have nothing has changed because the identity and the persistent storage is exactly the same. So this significantly simplified our operational process. Another thing that we learned from the day one is one size fits all doesn't work because we have clusters with one worker nodes and some with thousands of worker nodes. So we cannot expect the operators to jump in and tune the compute resources or memory or IOPS for the block volume. So we built in the auto-tuning from the day one so we continuously monitor the cluster's characteristics and if we see that this cluster is much larger than I thought. So we would automatically scale up the compute instances, memory and as well as the IOPS assigned to the block volumes. So 2GB is the default quota set for HITCity. So I spoke about the bbolt. So the bbolt has internal backend storage. So this quota setting which is by default 2GB limits the size of that backend store. So initially in the legacy architecture 2GB was sufficient but as we evolved and grew we consistently hit that limit and the operators get alarmed, they jump in, they do a defrag and then possibly increase the quota. It was unnecessary. So what we thought from the day one we would set the quota as 8GB. In fact we are talking to the HITCity 6 team to even consider increasing it to 16GB. Maybe 8GB is too old for today with the size of the clusters. Maybe 16GB or 20GB makes sense. Hopefully the HITCity 6 team might prioritize that. Another important thing with operating HITCity is the IO latency with respect to disks. So now we have built in the monitoring of HITCity and we continuously monitor the backend latency. So we have alarms configured against the block volumes provision and based on the alarms the operators jump in and then scale up the IOPS assigned to the block volumes. And lastly defrag is one of the critical operations which as you know can momentarily pause its remember. So we have more intelligent ways of defragging now. In our legacy architecture what we used to do is we wait for the quota to be HIT and the alarm to be raised and then the operators would jump in and do a defrag. Now we incorporated the automatic defrag in our code and we do it in a much more intelligent manner because running this unplanned manner may bring down the HITCity cluster because it's a like they say it's a stop of the world operation. During the defrag time that member is completely unavailable. If you are doing a defrag for the leader then it is quite possible that the leader doesn't send the right hard beats to the other members and you end up losing the leader election I mean you triggering the leader election and all those things. So we have to tune the way we do the defrag so that we have incorporated and now based on our discussion with HITCity we are tuning it even further. So I'll take a couple of slides about our migration process. One thing is obviously it has to be a zero data loss migration. No questions about that. And it has to be a zero downtime migration. So the Kubernetes control plane has to be available throughout this migration. Our intent is the customer is not even aware we are doing the migration. The only thing that we do which the customer which is observable to the customer is the customer cannot delete the cluster during the migration. So we typically take about 10 minutes per cluster and of course we do migration in a concurrent manner. But during that migration you cannot delete the cluster but we thought that's not a major limitation it provides additional safety for us. So we prevented deletion from happening but otherwise the customer can do anything he wants and he wouldn't even be aware that migration is happening in the background. And like I said considering the scale of the deployment it has to be fully automated. Operators cannot jump in and do a cluster I mean specific operation. You see set up the migration in for environment and it takes care of everything. And a couple of important things. While doing the migration we had a choice of updating the HITCity to a latest version for instance to 3.5 because many of the new features would have helped us to do the migration in a much more simpler manner. For example there is a new concept called follower model where you can add a new member as a follower and then wait for it to catch up with the data and do a promotion. So that would have simplified our job but we wanted to stick to the best practices. For example for older versions like 1.15 or 1.14 of Kubernetes we have to use corresponding older HITCity versions. So we didn't want to deviate from those best practices. So it complicated our migration code but we wanted to still do that because we didn't want to have any surprises by deviating from the best practices. And lastly leader election is pretty bad particularly for software and particularly for HITCity. So we don't want to do frequent leader elections. Let's say you are migrating the leader again and again you would end up triggering 2-3 leader elections during the HITCity migration. So we consciously took steps to ensure that the HITCity leader election doesn't happen during the migration. The only time it happens is when you are done with the migration and you are dismantling the legacy architecture. Yeah the orchestrator had inbuilt capabilities to do fine grained migration. So we had mechanisms to do migration in a given window and we had concurrency control like these many clusters are migrated at a given point in time. We usually start with one or two migrations and then we scale up all the way to 20. And most important thing is it automatically blocks the migration if there is any failure because we want the operator to jump in and mitigate the migration and understand whether it's a region wide issue or a specific cluster issue. So we automatically block the migration on first failure and another value I mean another important thing was to run a migration canary which is like a test application outside this migration window. There are so many things that happen outside the window like there are so many dependencies we depend on like block volume, DNS etc. Things can change overnight. So outside this migration window periodically run the canary and if something goes wrong we immediately block the migration and the SME has to jump in and see that okay the migration can be unblocked now. So that really helped us in many ways because there are many new changes that are rolled out which we haven't validated with if it breaks we don't want to do it with production cluster so we run migration in a given window and do the migration canary in the rest of the time. And for each stage of the migration that I would show in the next slide we had specific alarms and matrices so that way we get specific alarms and corresponding run books so that the folks can jump in the operators can jump in and know exactly what they need to do and potentially engage the right SME because the alarm and run books are very specific. So now I will take next few seconds to show how we did the migration at a very high level. So this is just one cluster like I have one control pane in the left with three compute instances running Kubernetes control plane and in the right I have one its city cluster. So first thing we do is we allocate the block volumes and DNS entries and assign to this control plane because before we moved to this new architecture these compute instances which hosted the control plane were pretty much stateless in the sense they didn't have anything associated with them like block volumes or the DNS names. So first before even we proceed with the migration we attach the block volumes and assign the DNS entries to those compute instances. Then we scale up the its city data store to five members. This is because by default we run with three but we want to be more resilient during migration so we scale it up to five members. And the black icon corresponds to the leader just to show that we won't touch the leader till the end. And then before we do any mutation to the system we take a snapshot into the object storage. We do have our core OS based its city operator taking the snapshot but we wanted to do it just in time before touching the members for migration. And then we go about moving one member at a time into the newer environment. So every time we do that we first ensure that all five members are healthy and then we go and move this first member and the same thing we do repeatedly for the subsequent members. Once all the members are moved and healthy we dismantle the objects for this cluster in the new environment. I mean in the legacy environment. And once all the clusters are migrated we dismantle the corresponding legacy cluster that we had which hosted the city parts. Now I would jump into some of the issues that we ran into. So I will start with the DNS solution issue. I mean I can talk about this for the next 15 minutes but I will try to simplify it as simple as possible. So I started off with the premise that we always ensure that all five members are healthy before we touch a member. So our assumption is we are going to manipulate one member. The rest four are healthy. So all our five members are healthy. I mean even if that member doesn't come up the member which is being migrated the rest four are healthy. So the control plane would not be impacted. So that was our premise. But in this case what happened was when we start the migration of the first member all five members started crashing. We don't do anything with those members but all of them are crashing with the error failed to update and member is unknown. So I need to set some context on this. Again like you would expect again this is narrow down to DNS but yeah blame it on DNS. But I would elaborate why it happened like this. So what we have things of okay let me go to the yeah this will give a context right. So let's say I'm moving one member at a time to the new environment. So where the member in the new environment talks to the legacy environment is through this gateway. And we ran code DNS in a wildcard mode where it looks for a suffix and if that suffix is there it will route the traffic to that gateway. The idea is in the legacy environment the part of the city parts don't have their identity. They are basically the pod names. They typically end with svc.cluster.local. So we had this wildcard code DNS supports this wildcard based DNS resolution. So we had configured saying that okay if it is star.spc.local route it through the city gateway. So a member in the new environment wants to talk to the peer in the new environment or the legacy environment. It queries the code DNS code DNS gives the IP address of the gateway and through that it talks to the city pod running in the legacy environment. But the member which is in the new environment when it wants to talk to other member in the new environment we have the DNS name assigned right. So based on that it communicates. So there is no code DNS involved there. Our BCN DNS gets the query and then it translates it and it communicates locally. But what happened in our case was let's say I am adding a new member and I am bringing up the new member. The member builds up a member table and it has a key value pair. So key corresponds to the DNS name of the peer in city member and the cluster member ID. Basically the member ID assigned to that member. So it tries to build that. First it builds it with default values which are basically Shah hash based. It creates a Shah and builds that member table and then it talks to the peers in the cluster and queries the table queries the cluster ID so that it could get the valid cluster ID. So and then tries to update this member table. So for that it uses a concept of URL comparison. The idea is you don't compare the DNS names directly you translate the DNS names and if the IP address you get is the same then these two URLs are same. So that is the idea of URL based comparison instead of string based comparison. It does this basically a DNS resolution and sees okay the two IPs are same so it's this is the right value I will update that corresponding entry. But in this case what happened was when a new member is coming and it's trying to translate it. When it's trying to translate one of its local members the query goes to the VCN DNS in some high load the DNS translation fails and so it falls back and it queries the core DNS and core DNS gives the gateway IP address and that ends up in messing up the table which is maintained by the ITCD member. So this again depends on the again this behavior is not well-defined in the library. So if you say how it G-Lipsy works versus how Muzzle works the behavior is different. In case of goal library it works more similar to G-Lipsy. So just to add more context in terms of translating right we have two things basically. One is end dots and the search domains. End dot says that the number of dots that should be present in the DNS names. If the DNS name doesn't have these many dots I will append the search domain and then do a resolution. Say for example you are trying to resolve Kubernetes.default and I said the end dots to be fine. Kubernetes.default has one dot so I immediately the library knows oh I have to query the search domain and append and do the resolution. That is the purpose of end dots and search domains. This is very common and this is the cause of many issues as well. So with the way the library is handled this is different from one library to the other. If say for example Buzzle library if the DNS name has enough dots already it wouldn't even fall back to append the search domain and do a resolution. But G-Lib C or in our case Go library what they do is first they try to do the resolution with the actual DNS name even if it has enough dots if the DNS resolution fails let's say the upstream DNS times out they will append the search domain and try to do the resolution. Because of the behavior we were exposed because we were trying to we have enough dots in the original DNS name so we are querying for some reason the upstream DNS failed and our library thought okay I will fall back and do append this SVC.cluster.local and our core DNS responded to this based on its wildcard based translation and gave the gateway IP address. So this totally messed up our table and all five members started crashing. Luckily for us we created this issue in pre-prod so we got away with it and then we adjusted our search domain so that we don't face this issue. We have created this issue against its city so that so basically how the other members perceive this is they are getting a connection from an unknown cluster member ID. So ideal expectation is when a member is connecting to you is an unknown thing you should just ignore the connection and get on with it. But here they are logging a message saying that it's an unknown member they are crashing. So we have raised a ticket against its city so that folks can look at it and we would also try to help them in this. I'll move on with other issues. So other issues are much more lighter so we can relax I guess. So we have PRTL has enabled everywhere like both for client communication as well as peer-to-peer communication. So every time a peer tries to communicate with another peer the receiver ensures that the certificate is valid. It does few things it tries to do a DNS resolution of all the entries in the subject alternate name. It also does a reverse look up of the source IP to ensure basically a DNS PTR request and ensure that it finds the entry in the subject alternate name. It works fine in a flat environment where all the peers are in the same subnet but in this case legacy and new architectures are communicating through a gateway. So the source IP is not going to match because the source IP is always going to be the gateway's IP when the legacy architecture is receiving the request. So the only way out for us was to disable this validation. Initially we had concerns maybe this is a security issue but fortunately we had our VCNs which write cyclist rules so that only these two VCNs can communicate so we are okay with opening this up and we disabled the sand validation during the migration. Another learning we had is it city aggressively does this DNS resolution so we did not remove this so our peer TLS had the entries for both legacy and new environment but once the migration is done the legacy environments DNS names are invalid but every time there is a connection from the peer it just tries to validate the DNS entry of the peer members that overloaded our VCN DNS team and they notice that there is a spike in the DNS queries coming from our tendency. So and then we fixed it by making appropriate changes. So the important takeaway is to keep the certificate free of craft and remove whichever entries are irrelevant to that environment. So I spoke about this follower model so where let's say when you add a new member what you typically do with the follower model is you add a new member and wait for it to catch up with the data and then promote it as a regular member instead of a follower. But since we are not using the follower model because we wanted to have the support of the devisions what we were relying on is the health checkup. But health check is not meant for this purpose because I have listed down the things that it does as part of the health checkup. First it ensures that there are no ITCD alarms like no space alarm or corrupt alarm and it ensures that the cluster has a leader and finally it does a quorum read. You might be aware there are two kinds of reads in ITCD. One is linearized read and other is serialized read. So linearized read is as good as the right. It tries to read from a majority, a quorum of members and then acknowledge and give back the read. So by default the health check does a quorum or a linearized read but you have an option to disable it and make it as a serialized read where it just reads from that member and respond back. So this is what it does. Ideally a quorum read should have been sufficient but we were not very convinced because it could be possible that there are some issues with the older versions of ITCD. So the way we went about this, I spoke about this raft index like commit index and applied index. So you can query those commit index and applied index from the ITCD like I can query a member and say what is your current commit index and applied index. So using that we ensure that okay a new member has actually caught up with the existing members by comparing this applied index and then we proceed with the next member. So we are pretty certain that the migration for that member is done. These are minor issues but I will still mention it here. So unlike other key-value stores or maybe other databases, ITCD expects that you have enumerated all the ITCD members when you are bringing up a new member. If there is any mismatch it will start at the very beginning. Ideally I would have expected I will talk to any one member and get the updated list. But if the list that is provided as a CLI, I mean as a command line parameter and bringing up ITCD, the bring up itself fails. So initially our orchestrator was populating this ITCD pod manifest. Basically we run the pods in a headless mode where in our control plane we push the ITCD pod manifest and Kubelet picks it up and runs the pods. So we were initially populating it with the members but let's say there are something happening in the legacy environment parts get recreated or things like that there is a mismatch and the pod doesn't come up. So we moved this enumeration as part of the start-up endpoint of the ITCD itself. Again, we didn't want to build a special ITCD image for migration because new images get rolled out and we don't want to get caught in the process. So what we did was we had init containers which come in first and push to scratch volume and then we made ITCD pick that init entry point. So that sort of thing. So we leveraged the existing ITCD image but we had additional init container which was pushing these additional configuration files for initializing the member. So one last issue I hope I will have time to finish this. Let's say when you are adding a new member so there are few things which are enforced. First is if you add a new member it wouldn't break the quorum. And another thing is all the existing members are healthy. So the second, the first issue which ensures that the quorum would not be lost makes sense because we don't want to break an existing cluster while adding a new member. But the second issue basically the restrictions on adding members kind of a bitters because let's say we are doing a migration in the new environment we are bringing up a new member but let's say something goes bad in the legacy environment and the operator jumps in and tries to add a member there. So the addition of the member fails because the HCD sees oh there are two member additions being tried. I am not going to support this member addition. So that bitters but there are ways to disable this but unfortunately the flag that was provided disables both these features like basically preventing the quorum loss as well as preventing multiple member additions. So we decided to live with it because we don't want to lose quorum if something goes wrong the operator has to jump in and manually remove one member to unblock the migration. Basically to call out that one good thing the HCD6 team does is it does a good job of back porting many of the features to the older patch releases all the way to 3.3 that really helped us because when we started out only thing we needed to ensure is for each of those minor releases we are in the right patch release so that really helped us with our process. This concludes my presentation. I have two minutes so if there are any queries about HCD or with the migration process I can try to answer. Thanks. Good talk. A couple things the whole DNS search domain fiasco I've seen this as well and I think over time we started just putting dots on the end of the host to avoid search domains in the first place from exploding the number of queries but my question is more about actually the persistent volumes and the choice to use persistent backing for HCD. So it sounds like when you are doing the migrations those persistent volumes aren't coming with you anyway. Is that correct? So in our legacy architecture we did not use block volumes because we are packing too many HCD parts in one member and we can't attach for example 40 block volumes to a compute instance. It depends again on the shape of the compute instance. So there we were relying on the scratch storage for the backend data but our backup operator was backing the data and we had an RPO of 15 minutes so it was taking the backup. So in legacy environment the data was in the scratch storage but in the new environment we didn't want to have that and the use of RPO is too much so we decided to have assigned block volume for each of the HCD members so that's why during the migration we had to create the block volumes. Yeah that's still on the lines of what I was wondering is the performance difference and the IO latency that you'd be able to get from using local MVME versus persisting everything to a disk I guess the trade-off of that versus if you lose the data you're just relying on HCDs to HA in the first place to recover that was a worthwhile trade-off. Right, so yeah like I said we do monitor the latency of the block volumes and we have the block volume provides all the IO throttling thing like if a control plane is being bombarded and there are some IO throttling we get alarmed and proactively assign additional IOps to the block volume that's how we handle this currently. Yeah I think it always feels like you can hit the same amount of IOps that I know are here attached but IO latency is there's still a pretty big gap there that's hard to overcome. I think I'm over so thanks for joining the talk and honoring me with your presence thanks a lot.