 Hello everyone! Today we will talk about Astakus – a cloud storage backup and restore tool for Apache Cassandra. This talk is being recorded for Cassandra Summit 2023. We will discuss Astakus's architecture, walk through backup and restore procedures, and finally restore a small sample cluster as a demonstration. A few words about myself. I've been in software development for more than 10 years, worked mainly in C++, in various application areas such as healthcare, IPTV, messaging, and emails. For the last couple of years I've been working as a senior software engineer at Ivan, and for a change I decided to explore the non-relational world, so I've been working on various Apache Cassandra related automation tools, including tools for disaster recovery such as Astakus. Why would we want an automated tool for disaster recovery? Well, at Ivan we manage many different clusters for many different customers. They all have various sizes, workloads, and we deploy those clusters on top of the public cloud infrastructure such as Google Cloud or Amazon or Azure. So to make all this work we need a lot of automation. Also we care a lot about security, so we want additional protection for the data we upload to cloud storage. And since we want to scale, we also want to keep track on our object storage costs. We don't want to waste object storage space since it costs us money. Though we came up with Astakus as the solution for Apache Cassandra and multiple other databases. Astakus hides backup restore complexity behind a simple REST like API, manages common things such as snapshotting, uploading, downloading, and backup retention for Apache Cassandra and some other databases. It is possible to configure data compression and data encryption in Astakus and that's what we usually do. The software is written in Python. A single Astakus process is deployed on each Cassandra node. Astakus processes interact with each other and coordinates during backup and restore operations. Finally Astakus is open source developed under Apache license. But there is already an open source tool that does backup and restore for Cassandra and it's called Medusa. Later during this talk we will do a brief comparison between the two tools so you could decide which one fits your use case best. Let's dig into the Astakus architecture. From the bird's eye perspective, Astakus is a collection of nodes, the same number of nodes as Cassandra nodes that interacts with the outside world via the coordinator API. This is the higher level REST like API of Astakus that includes operations such as backup, restore, and cleanup basically do backup retention. To implement those higher level operations, nodes talk to each other via a lower level API. It is also a REST like API so the communication happens via HTTP. The coordinator node that accepted the higher level backup or restore request talks to other nodes and instructs them to snapshot the database, upload the snapshot to the cloud, or download a snapshot from the cloud storage if it's restore we're coordinating. Any node in the Astakus cluster can become a coordinator and that coordinator node is responsible for the interaction with the outside world with a human operator that is invoking Astakus via CURL or Astakus client or some management plane software that is instructing the cluster to backup or restore and not unlike Apache Cassandra during read write requests. Finally, each Astakus node needs some sort of lower level access to the database itself. For Cassandra we need to be able to start nodes, stop Cassandra process, invoke some node tool commands, namely snapshot, and we also need access to the file system so we could grab that snapshot and then upload it later. All Astakus operations are asynchronous. The coordinator API method returns almost immediately and gives you a status URL that you can then query to get the status of the running operation. Also, Astakus clusters accept only a single operation at a time. You probably don't want to backup surrounding in parallel from different nodes. Each coordinator node, before actually invoking the operation, attempts to lock the cluster and only if that succeeds, it proceeds with the operation. So how does backup work? First we capture the cluster schema, then we instruct Cassandra to create snapshots for us for all the key spaces except the system ones. Then we upload those snapshots to object storage and as the last step the coordinator nodes compile all the information about this particular backup into a backup manifest, which is a piece of metadata that is stored separately in the object storage. When that succeeds, backup is deemed to be complete and we can return success from that status URL. Backup manifests basically reference the data items such as sastable data files or sastable index files or whatever other files that are there in the Cassandra snapshot. Since Cassandra sastables are immutable, we can store only a single copy for any sastable component that needs to be reused between manifests. For example, in this picture the sastable nb2 is being reused between the second and the third manifest because it existed in the snapshot at both at the time that manifest 2 was created and manifest 3 was created. A few words about backup retention. It's also a cluster level operation that invokes iterating over backups present in backup manifests present in the object storage and releasing all the manifests that no longer fit within the retention policy. We make sure that we have at least one backup, no matter what, and we also have a maximum number of backups that is configurable for the cluster. As the final step of the cleanup, the sastable data files or index files or any other objects we've stored during the backup are also removed from object storage to save on space because we don't need them anymore. Restoration is a little bit more complicated. First, we need to make sure that we have enough nodes on the cluster that we're restoring on to actually bring the data in and restore it. Astacus allows to restore the backup on a bigger cluster but doesn't allow restoration on a smaller cluster. Also, Astacus takes care of the availability zone balance. Each node is configured to belong to some certain availability and if the cluster topology at restore time doesn't provide a similar balance, so the availability zones at backup time cannot be mapped to the availability zones at restore time, then the restoration fails. If the topology fits our needs, then we restore schema from the manifest, then shut down the Cassandra nodes, download the stables from the object storage and distribute them in the data directory. As the very final step, after we've started up the cluster, we restore some other items in the schema such as user defined functions if there were any. Since Astacus is reused between different databases, there are common pieces of code that are especially database independent. Upload a blob to object storage, download a blob from object storage, compile a manifest, check topology. Those operations are database type independent and some steps during the restoration are specific to a certain type of database, like node tool snapshot for example for Cassandra is specific to Cassandra backup. So, Astacus is built on top of a plugin-based architecture. Most of the database specific logic is centered in the coordinator and the higher level APIs. When Astacus is configured, it is configured to load the plugin for a certain database and when backup is required, the plugin is asked for the actual steps that are needed to create the backup of this particular type of database. That list of steps can include general steps like upload snapshot, download snapshot or database specific steps like take snapshot for Cassandra. For example, here is the list of backup steps that were used when creating a backup for Apache Cassandra. It includes some generic steps like list hex digits, upload blocks and upload manifest and also some Cassandra specific steps like take snapshot and remove snapshot. So, we've discussed Astacus' architecture and some backup restore specifics. Now let's see those things in action. Let's look at the demonstration and backup restore for a small local cluster. For the purpose of this demo, I've configured a local Cassandra cluster using CCM. You can see the cluster status right here in the top right corner of the screen. It has three nodes, all three nodes are up and running. I've also populated the cluster with some data using Cassandra's stress tool. So, table standard one has some partitions written to it. So, there's some data for us to backup and then restore. I've configured a single Astacus process per Cassandra node, also running right here on my laptop and local host. Let's look at the node configuration in some more detail to illustrate the concepts we've talked about before. So, the coordinator section contains information that the coordinator needs to perform backup restore. For instance, the list of nodes to talk to. These are Astacus nodes. And database specific information configured in the plugin section. In this case, it is Cassandra since we're creating backup and restoring Cassandra database. The plugin configuration has some Cassandra specific details, for instance, credentials to access the cluster and the topology of Cassandra cluster as well. The node local section contains information that the node needs to perform backup and restore operations local to this particular node. For example, the node tool commands, start command and stop command all are pointing to the CCM binaries or commands that manage our local cluster. And the root is also set up to look at the CCM node one data root. For the purpose of this demonstration, I've configured a local object storage. But of course, Google AWS and Azure are also supported. There are multiple examples in Astacus GitHub repo on how to configure those. So it will store our backup information into a local directory on my machine. I've set up backup compression, but I haven't set up encryption just for demonstration purposes. It will be easier to look at what's actually stored if it's not encrypted. So let's create a backup and see how it looks. The backup should complete rather fast since we're not uploading data to actual cloud storage. So let's see what have we stored. First, we've stored the backup manifest. It contains metadata about the backup. Let's take a look inside it. Since it's compressed, we will need ZSTD cut to do that. So the backup contains the schema information. The SQL commands needed to recreate the schema. Also, no topology information so we can recreate the same token distribution on restore as it was during backup. And the snapshot results. The list of objects in object storage referenced by their hash, as well as their relative paths. Let's search for a Cassandra data file to illustrate. So yes, here it is. One of the SS tables, the data component, specifically, is supposed to be stored at this hex digest in the object storage. So since our object storage is local, we can actually locate it in there. And there it is. It looks like a data file indeed. So now that we know the structure of the backup, let's see how restore works. Since it's a disaster recovery tool, let's emulate a disaster. Let's remove the whole cluster. And it's no longer there. No data, no nodes, nothing. So to perform a restore, let's reprovision it, but let's not start it. So as we can see in the status, all the nodes are now down, node initialized, so empty, there's just the configurations and the binaries in place and all the directors created, but no data is actually stored. So let's create, let's actually restore our backup. Restore is powering up the cluster to recreate schema, then it will shut down the cluster again to put the data in place and then power it back on and leave it to us to use. I'm using the Astakus client mode during this demonstration. What it does behind the scenes is that it invokes the restore HTTP method and then monitors its status by the status URL provided in the body. So it could as well have been CURL that invokes the restoration process. You don't need the Astakus client to do that. As we can see, the restore is now done. The client will probably report it shortly. So let's see, the cluster is up and running. Let's connect to it and verify that the data indeed got restored. And this concludes the demonstration of Astakus backup restore tool for Apache Cassandra. Now that you've seen Astakus in action, let's do a short comparison between Astakus and Medusa. When it comes to Cassandra backup and restore, Medusa is the more established and more mature tool of the two. It's been around longer than Astakus and was built specifically for the purpose of backing up and restoring Cassandra's. While Cassandra support is rather recent in Astakus, the building blocks used to create it are very well battle tested. We're relying on the same code we use when creating and restoring backups for PostgreSQL and MySQL on Ivan platform. And we've accumulated quite a bit of experience with those over the years. That code exists as a separate library. It's called Romu. And we've shared this with the community. You can see the gift hub link on the lower left corner of this slide. Astakus is operated via a REST light API. The inter node coordination during backups and restores also happens via a REST light API, basically via HTTP. In Medusa, the backups are invoked via a common line interface or via a GRPC API. And the coordination happens by SSHing onto nodes. Under some circumstances, it might be easier to open a port for HTTP in the firewall rules than configuring a full-fledged SSH access between the nodes, for example. Astakus has a configurable option for client-side encryption. While it does consume some extra CPU cycles, it makes us independent from the object storage facilities related to encryption. We can store our customer's data securely on a provider that doesn't provide an option for encryption. Medusa includes an option to enable AWS server-side encryption. At present, Astakus doesn't offer a cross-topology restore option. If you had N nodes in your cluster during backup time, you need to have at least N nodes on restoration. And the extra node will not be used. After restoration, the cluster will include only N nodes. Medusa provides an SS stable loader-based cross-topology restoration option. So you could restore onto a different topology. So what's next in store for Astakus? Well, point-in-time recovery and incremental backup restore. We'll be working on implementing those with the existing building blocks and manifest infrastructure. Also, various scalability improvements. We're exploring the option of backing up and restore really big clusters with Astakus, like thousands of nodes. The present coordination mechanisms might be too simplistic for that. The requirement, for example, the requirement of having an up-to-date list of nodes in the coordinator configuration might become a challenge at that scale. And of course, various smaller improvements to configuration, logging, observability, stability, et cetera, et cetera. Astakus is influenced by the requirements of Python platform, so a lot of things that should be configurable still remain hard-coded. For example, the location of the data directory in the Cassandra route. So we'll work on making those tunable, making Astakus more community-friendly. And that's it. That's all I wanted to share with you today. Feedback and pull requests are always welcome. Never hesitate to contact us by GitHub or other means. And thank you for listening to this talk.