 a CNCF webinar, application snapshots using consistency groups. I'm George Castro, community manager of VMware, and a cloud native ambassador. I'll be moderating today's webinar. We would like to welcome our guest presenter today, Ravi Aluboyna, senior architect at robin.io. A few housekeeping items before we get started. During the webinar, you're not able to talk as an attendee. There's a Q&A box at the bottom of your screen in the Zoom UI. Please feel free to drop your questions in there, and we'll get to as many as we can at the end of the webinar. This is an official webinar of the CNCF, and as such, is subject to the CNCF Code of Conduct. Please do not add anything to the chat or questions that would be in violation of that Code of Conduct. Basically, please be respectful of all of your fellow participants and presenters. With that, I'll hand it over to Ravi to kick off today's presentation. Hi, this is Ravi Kumar Aluboyna. I work for robin.io. I'm the senior architect at robin. Robin is an application orchestration platform built on top of Kubernetes to run data heavy applications which fall into SQL, NoSQL, and big data segments. Robin is the platform to run Oracle Rack, Postgres, MySQL, or MongoDB, Cassandra, Cloud, or Kafka. Any data heavy application can be run on robin platform. With that brief introduction of Robin, let's get into our agenda. So I'm gonna talk a little bit about Kubernetes landscape and especially the application landscape right now on 1.16. What kind of applications are suitable for running on Kubernetes? And then we'll jump into a little bit detailed discussion on databases and their IO patterns. And what is the challenge in running databases and snapshotting databases or any data heavy application on Kubernetes? And past that, we'll talk about the core constructs that will enable us to run these applications on Kubernetes, which is a consistent sequence. And we'll have a brief Q&A session after that. With that, let's talk about Kubernetes and the application landscape. Kubernetes has become the de facto standard for running stateless applications. It's pushed aside every other orchestrator out there. If you wanna run an Nginx form, the default option is Kubernetes today. And all of these applications, be it Node.js or Java or Nginx, WordPress, all of these applications roughly qualified as web applications. Now, the second segment of applications are the databases, which could be Oracle, MySQL, Postgres, or MariaDB databases. These are traditional SQL databases. The following segment is distributed application stores, which could be document stores like MongoDB or Keyvali Paired Stores like Cassandra or Inflex, Prometheus, Redis. All of these are distributed in nature. These are modern-day distributed data stores. And to the far right is our heavy hitters. These are our big data applications, which could be Hadoop stacks or Elasticsearch or Splunk or big OLAP application stacks, analytical processing stacks. That's our big data stack. So this is a classification from the application site. But if you're building a platform, if you're a Kubernetes orchestrator, there is a different division. So we call them stateless applications. We call all of these stateful applications. What we are going to dive into is deploying and orchestrating stateful applications, especially taking snapshots. So with that classification, let's get a little bit deeper into databases, which is SQL database and no SQL databases. And let's inspect what is it that makes deploying and managing databases very challenging to be able to translate. Let's go briefly over these two segments of SQL databases and no SQL databases. Some of the attributes of SQL databases are primarily acid compliance. And these are monolithic architectures, primarily because these are built like decades ago, where bare metal is the predominant platform where these applications were deployed. And these are row-oriented databases. Their primary workload is transactional workload, and they have a standardized SQL database. It's a very, very well-known data store. Whereas the no SQL databases have made some compromises to address certain workloads. There is a tunable consistency. There is tunable durability guarantees. These are scale-out architectures to leverage commodity hardware. And like I said, they target different workloads. Some of them target Q-value pairs, some of them are document data stores. And they serve both OLAP and OLTP. They serve both transactional load as well as analytical load. And they have non-standard client interfaces. Some of them do give out SQL semantics, but they have non-standard client interfaces like SQL or MoVoDB client or Redis client. So this is a very, very high-level overview on the distinction between SQL and no SQL databases. So this is important in understanding or designing the primitives that enable us to run these applications on Kubernetes. Let's talk about the standard database deployment models. So this is our progress application. Assume that it is running in a container. We are talking about Kubernetes here. So what is required for running Postgres on Kubernetes? Obviously, you'll need a data volume, which is provisioned through a CSI and that is coming from a storage stack. Let's call it data volume. The question now is, is this enough? And a test environment may be yes, but in a production environment, no. Because the recommendation of Postgres or for that matter, any database is that you put the right head log or redo logs or undo logs in a different disk. So naturally we'll go and ask Kubernetes CSI to provision another volume. So these forms are Postgres applications. A container and two volumes. When we talk about applications like Cassandra, which are distributed in nature, which fall into the other segment of no SQL databases, it follows the same pattern. It's not enough to just provision a data volume. So we have a commit log here, which mandates a different spindle or a different volume. So these are the recommendations for performance. So if you want better performance, you follow these recommendations of having two volumes. And on top of that, Cassandra is a little complicated because it's a distributed application. So we have more than one pods running and each pod consuming more than one volume of different IOP items. Of course, this is a consistent hashing scheme for Cassandra. So what we have established here is on a standard SQL database, it's not just a container and storage, there is more to it. So you really have to understand what the application is doing and allocate storage accordingly. There is a data volume on this one right ahead long. When it comes to Cassandra, it's even more complicated because there are multiple containers in play and multiple volumes in. So with this understanding, let's see the actual transaction flow. What happens when a user issues a SQL statement? Assume that this is the data volume. So which is where the actual final data resides, which is one volume. And this is the wall. This is the right ahead log volume. These are two separate volumes coming from two different, let's say disks. Now let's say user were to issue a SQL statement. A SQL statement could be autocommit or it could be a series of statements with a star transaction and a commit under rule back at the end. Transactions start a bunch of statements and a commit makes one set. Let's say there's a star transaction. Usually what happens in majority of the databases is there is a star transaction entry written in the wall file. Let's say there was some data changes, which is an insert query or an update query. A block gets added. Whatever data, whatever change set is applied through the SQL statement is written into the transaction wall. Let's say for example, there is an update on a key, X, we are changing the value to 20. So that X equal to 20 is written in the wall. Not exactly in that format, but roughly. The idea is to capture the changes in the transaction. And we make more changes to the database and then we do a commit here. So commit will be written as a marker entry in the wall. So what we have now is a star transaction, a set of changes and a commit. At some point, database processes will come in and do some sort of compaction. Different terms are used in different database architectures. Essentially, what it means is we take all the changes and apply to the actual data blocks, which means we are writing the IOs. We're taking the IOs from one volume, compacting them and writing to a different volume. That is where the value X equals 20 is applied to a data block. So this is a standard flow in our SQL database. In a no SQL database, it's slightly more challenging because there's a star for the transaction because some of the no SQL databases or distributed databases do not offer transaction semantics. They offer pseudo-transaction semantics. Same thing applies, there's a transaction, the primary records, it admits commit log and the commit logs are transferred to the secondary sources or the replicas. The replicas are registered to these changes and when there's a commit, there's a flush that happens. Almost a similar behavior. So we have seen that the changes are applied to the transaction log first and then periodically they get flushed to the data blocks and why is that? Fundamentally, there was some design assumptions made either for performance or durability or persistence. And what are those design assumptions? On the infrastructure side, the design assumption is that the transaction log and the data files will go to separate disks. And why is that? Because there are two rights involved and we don't want to have the two rights going to the same spindle. That will create an IO blender effect which is you spin the disk in multiple ways. We don't want to spin the disk back and forth. We just want to keep happening. So that is wall. It is also possible that there could be multiple transaction logs. So we want all of these transaction logs to reside on different disks. And wall is an append-only workload to leverage the sequential access of a spindle. And crash consistency is assumed and designed into databases. At any point in time, if you abruptly shut down the server on which a database is running, it will come back up online. So that is assumed in the design. And that is one of the reasons why we have redone and do logs in databases. And NoSQL databases, which are distributed databases, assume that the partitions, which are those independent individual containers, they go on different physical hardwares. So these are some of the design assumptions on the infrastructure side of it. So in a database designer regime, I will have this set from the infrastructure and then I'll do some IO optimizations, which is VAL, try to add log, which is an append-only workload. We call it undo log or redo log in different database schemes. It's an append-only workload. It's a persistent circular buffer. It could be one or more buffers residing on the disk. There is dirty page management. This is like in-core structure versus on-disk structures. It's auto-accoming. There are different terms used in different database technologies. It is double-red optimizations in MySQL database. And there are parallel X log, X log rights, which are transaction log rates. So what are we coming at? So we talked about the assumptions made on the infrastructure side. We talked about some IO-path optimizations on the SQL side. This is, we are barely scratching the surface here with the IO-path optimizations. So I'm talking in terms of, my goal is to make you understand what is the infrastructure requirement just for this X log. So what we are looking for is, in order to design a data production strategy for a data heavy application like databases, we really have to understand how these databases have been. What are the design assumptions made by these applications? And what are the recommendations made by these application values? Now that we know a little bit about these assumptions where your VAL and data volume have to reside on different desks, if it is a distributed database, they have to go to different, the containers have to be provisioned on different hardware. Let's design a data protection strategy for this application. Let's talk about our standard postage database, which could be MySQL or Oracle, Oracle or Oracle Rack. If you want to use standard volume snapshotting, by that I mean, we could be using SanLunc for this data volume or the transaction log, or we could be using plain LEM-based volumes for data and transaction log. In either case, it is a volume that is provided to a database. And all of these volume management options, be it San or LVM, they have some sort of snapshotting capabilities. Let's say for to use those built-in snapshotting capabilities and snapshot these volumes, the data volume and value. So first hour we take a snapshot, which is we create a snapshot of these two volumes. Second hour we create and third hour we create a snapshot. What is wrong with this approach? Or more than what is wrong, how are we taking these snapshots? The loop. One of the simplest strategies to take the snapshot when multiple volumes are involved is to write a plain loop. That's right, that loop. Nothing but three lines for each volume, take a snapshot over one. What's wrong with this? Nothing so far, right? But when it comes to no SQL databases, it's again the same loop, loop or loop for each container, snapshot each volume in that container. If it is as simple as this, the world would be at peace, right? There won't be any problem, but unfortunately it's not as simple as this. It is actually much, much more complicated than this. And what is the problem? This is a problem. The looping is the problem. We'll see why the looping is a problem here. So we have these two database volumes, which are our data volume and a transaction log. Let's see what the loop is doing to these volumes. So this is our loop. Let's run through this loop. Time t zero. I'm assuming that, not assuming, but there is an orchestrator that is running this loop. There is some process which is executing this loop. And at t zero, that process initiated a snapshot of the transaction log. So we got our snapshot of the transaction log. It is a process. So it is bound to scheduling these. Let's say our process got swapped out. The process that is running this loop got swapped out. So after some time, the process is rescheduled and it'll trigger the snapshot of the data volume. We got a second snapshot of the snapshot of the data volume. Depending on how the loop is implemented, it could be other ways as well. We take the snapshot of the data and the process gets scheduled down and we go and take the snapshot of the ride add log. Anything is possible. I'm just giving you an example of a standard database, but imagine a Cassandra application with 50 partitions which will translate to maybe 100 volumes. And if you follow the strategy of for each container, for each volume snapshot, you could be getting snapshots of these volumes at different points in time. Question we should be asking now is, it is inevitable that you get snapshots of these volumes at different points in time. You could argue that the delay is in milliseconds but it is inevitable that you get snapshots of these individual volumes at different points in time. The big question now is, are they consistent? Let's keep that question in mind. Let's go ahead and see what the problem is. So this is our timeline. Somebody runs a commit. It's the starterer transaction. They shoot a bunch of updates and there's a commit statement. So this is how, so this is how, right after shooting the commit, this is how our right ahead log will look like and this is how the data blocks will look like. X equals 20 is written to the transaction log and a commit entry is written to the transaction log. There's nothing written so far to the data blocks because that is a periodic thread. It depends on a cadence or a trigger in the database logic that will come in, take the entries in the transaction log, compact them and write to the data blocks. That hasn't happened yet in our time sequencing. Now let's say we start the snapshot here at that point. How are we taking the snapshot? Our for loop is at work here. The loop comes in and tries to take a snapshot. Note that it's a loop. Let's assume that it picked up the data volume and took a snapshot. So that's a snapshot of our data volume. Now we have scheduled in this, but while our snapshotting loop is out, so we have database flush at work which is the database process that is coming in and compacting the entries here. And that database flush did a transaction flush which means it read all the entries from the transaction log and applied it to the data blocks. Now if you see the picture here, the X equals to 20 is in the data volume. It's no longer present in the transaction log because the transaction log or the wall files are circular buffers, persistent circular buffers which means the blocks are reused. So you just flush the entries from the transaction log. Now our orchestrator comes in, the snapshot orchestrator is trying to finish the snapshot which means it'll reschedule the loop. In the loop, the next volume is a transaction log and we take a snapshot of the transaction log. So that, this, the picture that you see on the far right is our final snapshot. What's wrong with this snapshot here? The snapshot here has no entries. Various are X equals 20 in the snapshot. It's gone. So that is a problem with our snapshot. Let's say if you were to restore from the snapshot so we get this picture. So you can clearly see the difference here. We ran a snapshot. The snapshot was triggered after we issued a commit with the asset guarantees of the database. X equals 20 should be present in the database in the snapshot. But unfortunately we don't have it. So there is a race between the for loop and the DB flush. There are two processes which are racing here and that will cause an invalid snapshot to be created. This is a big, big problem. And this is where our consistency groups come in. The core reason why this is happening is because we are treating these two volumes as two different entities and taking snapshots of these individual volumes. And this is a real problem in database snapshots. This is actually a data loss scenario. What are consistency groups? The consistency groups are groups of volumes that function as a unit for the application which is our transaction log and database file. And this is a group on which the life cycle operations work. It is snapshot, creating one or extending volumes and group that maintains the right order. So you write your transaction log and then the rights make to data volume. The right order is maintained here. And the group that preserves the crash consistency semantics because there is an assumption that the transaction log is written first and the rights are taken and written to data volumes. We maintain the crash consistency. At any point in time, if the database were to go down, we know for sure that the data is in the transaction log. It could be undone or redone based on whether the transaction is committed or not committed and whether we follow a redirect and write approach or copy on that approach. Let's design a better data protection strategy. Now that we know about the consistency groups, let's design a better data protection strategy. Let's combine these volumes and call it a consistency group. So we take a snapshot again on the first hour, we get a consistent snapshot of these two volumes. Same goes for the repeated snapshots that we did. So this will solve the problem of consistent snapshots. But is there anything more that we can do? So we can do more, before that more. What is the problem in the current case that K it is a Kubernetes CSI stack? It only operates at a single volume level. The volume consistency group concept is being worked on. For right now, there is only single volume level interfaces. I was talking about, is there anything that we can do more? Yes, there is something that we can do more. But before that, I want to explain something else, which is database is not just a data file or a data volume and the transaction volume. It has configuration as well. It could be database configuration or it could be some metadata settings. Most of the database stored in the metadata tables, but there are some databases or some applications, I should say, like Cassandra would store in some configuration files. So there is not just data on well, but there's also configuration. But how do you capture that configuration part of the snapshot? The better approach is to actually snapshot everything, the volumes as well as the container, which will capture the volumes, the container. When we say container, we also capture the, we also capture the topology of it. If the Cassandra has 50 partitions, we capture the fact that there are 50 partitions in this snapshot. And after the snapshot, if you're to scale out to 100 partitions and move back to that snapshot, you'll get 50 partitions back. So we capture the configuration, topology, and the data. So that's a much, much better approach to snapshotting than just snapshotting volumes. So the concept is we want to protect the entire application, not just the volumes. So Robin provides primitives on top of applications at an application level. So you can take a snapshot of the entire application and we protect data, config, and metadata. So you can roll back to any snapshot at any point in time. We can push that snapshot, which is a consistent point in time image to a remote target by using the backup command. And we can restore from that backup at any point in time. And note that we do this. So I'll go over a brief architecture overview of Robin. So we have built on top of Kubernetes. We have not changed any Kubernetes. We have taken a CNCF certified Kubernetes deployment. We've built a distributed storage stack with all the enterprise-grade features like snapshots, clones, QS replication backup. We also have a pluggable networking stack. What we realized is for applications like Oracle, we need the container identity persistence, which means the IP address has to be written. So we have storage. We have compute in terms of Kubernetes and we have networking. And we built a workflow manager, which will orchestrate the snapshots, clones, which is responsible for the consistency groups, backups, et cetera. With this package, we can manage the end-to-end lifecycle of these applications. Robin is a software-only platform that can be deployed in any cloud, AWS Azure Google Cloud Platform, or it can be deployed on a bunch of VMs. It could be ESX or it could be open stack VMs or on bare market. And like I said, we can consume, we can run on any cloud, any VM. We can run as a storage-only provider on top of any star keys, which is Kubernetes engines running in cloud. And with that, you can run any of the distributed applications and we allow for the lifecycle management of these applications. So we have customer deployments of Robin's software, allowing customers to run huge DDIV application clusters. We have a deployment where there are 11 billion security events getting ingested into Elasticsearch, LogStatch, and Kibana, Alt-Stack. Finally, there is a big multi-hadoop deployment running on single Robin cluster with two cloud-era data lakes and two cloud-era compute-only clusters with Kafka and Druid. And there is a big deployment of 400 Oracle Rack databases running on Robin cluster today. And if you're interested in the product demos, please go to robin.io and check out how we do snapshots and rollbacks and how can we create clones of these applications out of snapshots, how can we back up to cloud and instantiate an application in cloud? And you can get a free trial from getorobin.io and please also subscribe to our Slack channel, slack.robin.io, and that is Robin. Supercharge Kubernetes to deliver big data and databases and service. Awesome, thanks, Ravi, for the great presentation. We now have some time for questions. If you have a question that you would like to ask, just drop it in the Q&A tab at the bottom of your screen and we'll get to as many as we have time for. We have one question so far from Peter, who asks, hello, is Robin ready for Oracle Cloud Infrastructure? So we are working towards it, but as of today, it's not. It's not certified yet. Okay, and that was a single question we have so far. I'd like to leave it open here for a minute or so if people have follow-up questions. So while the questions are coming in, I'd like to show a quick demo of Robin. Like I said, Robin is a software platform. You can install it on a bunch of VMs or a hardware. Once you install, you get a dashboard like this, App Store Experience, where you will have all these data heavy applications. You have Cassandra, Cloudera, Couchbase, Kafka, Hortonworks here, Vizoracle, Riot, EVSV. So all of these applications, 25 to 30 applications are out of the box available as bundles that you can use, which means you show up on your dashboards and click and deploy them. And I have some applications deployed here. So I have Oracle Rack, I have EVS, Cassandra, MySQL and MongoDB. Let's go to Oracle Rack. So Oracle Rack is an active active systems, which means there are two containers at play here. And if you can see this, there are lots of volumes here. Same thing that I was trying to explain, that there are multiple data volumes. There are multiple redo volumes. There is Flash, there's Grid, there's Reduce Volumes. Although this is a small deployment in terms of size, but I wanna show you the scale in terms of number of volumes that needs to be managed. So if you have an application like this, and if you want to take a snapshot, imagine the pain in using sand snapshots or LBM based snapshots. With Robin, the entire application can be deployed as well as managed. Why that I mean, you can just say snapshot. You can create a one-time snapshot. And if you have noticed, there is an option to create an application consistent snapshot. What databases give out of the box is a crash consistent snapshot. We can also allow you to plug in hooks where you can freeze the application if you choose to and then take a snapshot. So now we have an entire application snapshot available for us. So this is a snapshot. Once you have a snapshot, we can create a clone or we can restore to a point in time. The entire application will be rolled back to that point in time. And there is an option to configure schedule snapshots. You can run early snapshots or daily snapshots with retention periods. The same flow works for a MongoDB or Cassandra. It's exactly the same thing. Here I have a MongoDB application, which is a MongoDB distributed applications. You have a lot more containers in play and there are a lot more volumes in play here. And if you want a snapshot, this MongoDB cluster, I can just say snapshot. So that is the power that Robin brings to the table. So you can not only create applications in a fault-on-run manner, but you can also create application consistent snapshots. George. Yeah, I'm here. I'm here. I can, if you have some more time, then I can show a little more details here. Yeah, absolutely. We don't have any questions. So unless anyone has any more questions, we could just keep shot at you. Yeah, yeah. So we talked about the snapshots, but let me also show you other interesting facts in Robin, which is you get a dashboard and let's say if you want to deploy cloud error here. So you can choose what role you want to deploy. Say I want Kudu here. Kudu is there, cloud errors, OLTP platform, the transactional platform, built on top of it, not built on top of it, but it's a different engine. So what I want to show here is you can set the sizing for your compute, which means course and memory. You can also define the storage. You can say I want replicated copies spread across racks and I can set the block size. I can set compression. I can set packer. This is the layout of the data on the disk. I can set encryption. You can also define workload type. These settings are more useful when you're defining something like a data load. I'd say I need three data loads in a Hadoop cluster. This is my compute capacity. And here I'll probably say the Hadoop throughput intensive volumes. Yeah, local is throughput intensive. And I know that Hadoop does three-way replication. And this is a very small deployment. So let's say I have three instances here. Here is where an interesting thing comes in. We're prevent placing more than one data load container on the same load. So what this will do is it will place the three containers on three different physical servers. And we can also enforce storage and compute for the data load to be coming from the same load. These two policies are a must have for any distributed application that is doing its own replication. And there are many of them. There's Hadoop, there's Kisandar, there's MongoDB, there's CouchDB, Influx. There are many applications that do replication. So this is granular placement policies that Robin offers. So with that, if there are any more questions, then we can take up the questions. We do not have any more questions. That was great. Last chance for questions, everybody. All right, great. Thanks, Ravi, for the presentation. Thanks for joining us today, everyone. This webinar recording and slides will be available online later on today. And we look forward to seeing you at a future CNCF webinar. Thank you, everybody, and have a great day. Thank you. Thank you, everyone.