 So tell me, can you try to set this? I will just check it. Hello, hello. All right. So that's fine. See you soon. So hello, everybody. This is stage mail, and we will present you a talk about safe data services. Thank you. Good morning, everyone. Thank you for coming so early in the morning. Brad, I didn't wish you tailed. My name is Sage Weill. I work at Red Hat on the Cef project. I've been working on Cef for almost 15 years now. And today I'm going to talk about some of the multi-cluster, multi-cloud, hybrid cloud capabilities in the Cef storage system. So just briefly, what I'm going to go through, I'm going to give the briefest of introductions about Cef just for a little bit of background. I'm going to talk about what I mean by data services and what the business problems are that I'm looking at and trying to solve. And then we'll just go through block storage, file storage, and then object storage. We'll talk a little bit about edge scenarios and then try to wrap it up, talk a little bit about the future, and build a coherent picture. So Cef is a unified storage platform that provides an object storage API, a block storage service, and a distributed file system. It's all built upon RATOS, which is the Reliable Autonomic Distributed Object Store, the part that replicates your data, places it across lots of different nodes, handles failures and recovery, and provides availability, and all that good stuff. And then all of these three services are built on top of that underlying storage infrastructure. Cef is an open source upstream project, obviously. We release every nine months. Every release is named after a species of cephalopod. Nautilus is the next release that's coming out later this month, at the end of February. And the one after that is going to be octopus about nine months later. So the features and things I'm going to be talking about today are mostly going to be focused on, it's a mix of what exists in Nautilus and previously. And I'm also going to be talking about things that we're planning and we're going to be building after that. So for the Cef project, we like to talk about having four key priorities. The first of those is usability and management. Cef has grown up over many years and built by system administrators, horses and administrators. And as a result, has developed a reputation for being hard to use and complicated and difficult and so on. And so one of our key focuses over the past two years has really been to simplify the system so that people can understand it without being a Cef expert. So a lot of effort goes into that. If you want to hear more about some of the things we're doing, you should attend the software dev room. There are a whole slew of Cef talks this afternoon. Performance is also key. When we built Cef, everything was hard drives. And so the software models you choose and so on are a little bit different. And today, new hardware is moving to high performance flash. And so there's a lot of work going into optimizing Cef so that it will go much, much faster. And then the last two, we're making sure that Cef will work and live well in a container environment and a container world. So making sure that Cef works well with Kubernetes is a big priority. And then the last one, which I'm mainly going to be focusing on today, is making Cef work in a multi-cloud and hybrid cloud world. And I'll talk a bit about why in a second. So first, I want to give a little bit of background about what I mean by David Services and why we are here today. So the first point is that the future is cloudy. So IT organizations today tend to have multiple private data centers and infrastructure spread across lots of data centers. They also tend to consume multiple public clouds and have different applications and different data sets in all these different places. And it's only getting worse. Even that infrastructure that's on-premise is starting to use private cloud services. And so it's really just a cluster of different clouds, some of them private, some of them public. And application developers are increasingly building their applications by consuming self-service storage resources, whether it's compute or storage. And if it's not already true today, the next generation of applications that people are building are going to be built specifically on top of these services. So it's going to be a cloud-native world. And for all the talk about stateless microservices and container orchestration and so on, they're great. But in reality, in the real world, applications have lots of state, lots of data that they depend on and store. And state is hard to move. And it's hard to move it in a safe and coherent way and so that you don't disrupt your application. As anybody who's had to do some sort of data migration by hand has experienced, this is a tricky problem. So when we talk about data services, we're talking about three key areas. The first is data placement and portability. So I have the data set, where should I put it? Once I store it in one particular place, can I move it to somewhere else? Maybe it starts out on-premise and I want to move it to the cloud later or the other way around. And can I do that in a seamless way without interrupting the applications that are consuming that data? So the second piece of this is introspection. I'm a large organization. I have infrastructure spread across multiple private and public clouds. I'm storing 20 petabytes of data. What is it? Who stored it? What type of data is it? Can I delete it? How much is it costing me? Those sorts of things. And then sort of the third area is around policy-driven management of that data. So can I automate the process of deleting old data after I've sort of for two years for compliance reasons, that sort of thing? Can I ensure that this particular data set stays in the EU so that it complies with certain laws and legal restrictions? Can I optimize where this particular data set is placed based on the cost and the performance that I need over time? And can I build automation and policy around all of those processes so that I don't have to do that manually? So data services tends to sort of encompass these three key areas. For the most part, I'm going to be focusing on the first area around where do I put my data and can I move it around for the purposes of this talk. And it's also important to realize that data service is about more than just the data and just the storage system. Because the reality is that it's not just that you're storing data. It's that that data is being consumed by some application that's modifying it or presenting it to the world or doing something useful with it. So if you're going to move the data to another data center or to another cluster or on or off-premise, you also have to move the application that's consuming it. And you have to make sure that that migration is done in coordination with whatever the processes around moving that application to make sure it all works. And so for this reason, we believe that container platforms are key. Because if you are going to do that sort of migration, you have to have this relationship between the application migration and the storage migration and coordinate that whole process. And so you're going to hear about Kubernetes off and on throughout this talk. And that's why, because it's sort of the first opportunity where we have a coherent stack that's managing both your applications and your storage and some possibility of automating this process. So we're going to break this down into sort of five data use scenarios or sort of underlying business problems that organizations are trying to solve with their data. So the first is simply multi-tier. You have different kinds of data. They have different performance and storage capacity requirements. And you want to have different tiers of storage where you want to put your data. And it might vary over time. So it might be that when you first write it, it needs to be high performance. As it ages, you can move it to a slower tier for archive for some period of time and then delete it. So having different tiers of storage is key. The second is mobility, the ability to move an application and its data between sites, presumably with minimal or no downtime or availability interruption. And this might be that you're taking an entire site and moving it, or it might be that you're moving just sort of a piece of one application, perhaps, in a site and moving it to another. The third is disaster recovery. So if you have your infrastructure spread across multiple clouds or multiple data centers, you want to be able to tolerate an entire site going down and then re-instantiate the data in another site for business continuity, so you don't go out of business. And often in this case, you want sort of a point-in-time consistent view of that data when you restart your whole infrastructure so that it says if you crash and you can pick up where you left off. Maybe you can tolerate the loss of the last few minutes of transactions or a few seconds, or maybe you need a coherent copy. The fourth case is what I'll call stretch to the similar situation where you want to tolerate a site outage, but you want to do it without any loss of availability. So in the disaster recovery case, you lose the data center. You panic, and you restart everything in a secondary data center. And maybe you lost a little bit, but you didn't go out of business. In the stretch case, you don't even notice. Your application doesn't go down. And then the final scenario is edge. So you might have a bunch of central data centers, and then you have a whole bunch of satellite data centers or mobile sites or something like that. And they're all generating data or consuming data, and you want all of this to be managed and coherent. So these are the five business cases that we'll be referencing throughout the talk. And ideally, we want to solve all these problems in order to have the nirvana of storage. And finally, the last preliminary point. A lot of this is going to come down to whether you're doing synchronous replication or whether you're doing asynchronous replication. So in a synchronous replication model, your application does a right. It's written to all of the replicas of that data. And only after they're all sort of persisted does the application say, OK, I'm done with my right. In an asynchronous model, you issue a right. It might write to some of the copies. And the application says it's done. And then in the background, later over time, they'll make copies to additional replicas. So in synchronous replication, you tend to have a higher rate latency because you're waiting for all the replicas to write and complete. But you have a consistent model because all the replicas are always in sync, presumably. And your application can go on. And usually when we talk about synchronous replication, we're talking about a single sef cluster. Because internally to the sef cluster, we're doing synchronous replication. And in the async case, your application might write. And then another cluster might have an asynchronous copy. If you have a failure of the first cluster and you have to pick up where you left off, you might have stale data. And so whether you're doing synchronous replication or asynchronous replication depends on what the needs of your application are and latency and so on. All right, so we're going to talk about block storage and then file and then object. So when we talk about block storage, we're talking about a virtual disk device. So a virtual disk is generally used exclusively by an application because usually you're layering a local file system on top. It assumes it's the only one consuming that disk and writing to it. So block devices have to provide strong consistency. They're usually performance sensitive. They know you read some writes. They usually have snapshot features around it. And in these days, they're usually self-provisioning. So you're usually asking Cinder for a block device or you're asking Kubernetes for a persistent volume. Or maybe you're emailing your IT admin and asking for a loan. But hopefully, hopefully not the latter. So relatively simple model. What can we do? Well, today, and actually from the beginning, RBD has always supported multiple tiers. So within a single SEF cluster, you can create different rados pools of storage that are mapped to different storage devices or maybe a specific set of devices in a particular rack or data center. And you can store an RBD image in one of those pools. So you might have a pool that's high performance, backed by SSDs. You might have another pool that's lower performance with hard drives. Maybe it's even a raster-coded. And so you get these multiple tiers of storage depending on the quality of service that you want. So that's great. We sort of cover the multi-tier case. In Nautilus, we also have a new feature called RBD Live Migration that allows an in-use image. So for example, if you have a VM that's running, consuming a block device, you can do a live migration of that RBD image from one tier of storage to another without interrupting or restarting the VM. So you've always been able to live migrate the VM to another machine consuming the same storage, but you haven't been able to move the storage to new performance tier while the VM is still running. So this adds that capability. And that's new in the Nautilus release. That's out this month. So that sort of gives you multi-tier and gives you some mobility within the same data center or within the same stuff cluster at least. So what happens if you want to stretch across sites? So you can also stretch a stuff cluster across multiple data centers, two or three data centers. And in that case, you simply deploy your stuff demons across two sites. You have some big, fat link between them. You set up your crush rules so that your replicas are placed appropriately. So you have some replicas in one data center and some replicas in another. Make sure you have enough monitors. And lots of people do this. It works. Your application can now move between data centers because the storage is available in both. You have sort of the single SEF storage available in both sites. The data doesn't move because it's sort of already in both places. But it's important to remember that the performance is going to be different because you have a wide area network link between the two sites. SEF is doing synchronous replication, which means every right has to cross that link. So this may or may not fit the use case. But it does solve sort of the disaster recovery use case where if you lose one data center, assuming you set up your crush rules properly, your cluster will continue to operate accessible on the other site. And you can also sort of combine these things. So you might have a single SEF cluster across multiple sites. You have one RATOS pool that's stretched across them with a particular crush rule. But then you also have RATOS pools that are confined to each data center. So you have all tiers of service. It might be that some VMs only need local storage. They don't need that multi-site redundancy. Others do. And so depending on what tier or what type of service you need, you can give that to your applications. And if you combine all of these things, you can even leverage this capability for migration. So you might have an application that starts out using the local RATOS pool in one data center. You live migrate that to the stretch pool. Then maybe you even live migrate the virtual machine to another data center or restart it, whatever it is. And then you can live migrate it to the RATOS pool in the other data center. So if you combine all of these things using a stretch cluster, you can sort of get the full breadth of mobility and disaster recovery and edge sites and so on. So it sounds pretty good. But the thing to keep in mind is doing stretch clusters is a little bit sketchy. It isn't necessarily what we want to do. So the first thing to keep in mind is that the network latency between those sites is critical. So you need low latency for performance because within a RATOS pool, we're doing synchronous replication. You also need to keep in mind that you need to have high bandwidth because if you have a failure situation, then you're going to have a lot of data flowing across that link in order to do rebuilds and recovery. And that might be more than you expect. And you need to be able to sustain that while you're still doing your normal application reads and writes. It's also important to keep in mind that it's relatively inflexible, sort of building these big stretch stuff clusters. So it might work for two or three data centers. It's not clear that you want to build a stretch cluster across 20 data centers. Or it's even likely that you're going to have 20 data centers with low latency links that are close enough together that you would even want to do this in the first place. And you can't take two existing stuff clusters that you already had and decide later that you want to join them together so you can have this capability. You can't take two stuff clusters and smush them together later. You have to build it in pre-plan ahead of time in order to do this sort of situation. And finally, deploying a stretch cluster means you have a high degree of coupling. You have all of the storage infrastructure across multiple data centers and multiple failure domains that has a single point of failure of a single instance of SEF, single piece of software that's responsible. So if something goes wrong with SEF, if there's a bug or something, it's going to affect all your sites and not just one of your sites. So you definitely want to be amused caution if you're taking a stretch cluster approach. So what can we do instead? So RBD also has an async mirroring capability. So instead, you would have an existing SEF cluster with a normal RITUS pool, a normal RBD image, doing synchronous replication within that site. And then on top of that, we layer an asynchronous mirroring capability that lets you mirror that image to a second SEF cluster or another pool in another site. And the way it does this is on the primary image, we maintain a write journal. So we have a log of the sequence of writes that are happening to that device. And then there are mirror demons that just sort of send all those writes over across to the other cluster. And so you end up with a disaster recovery backup copy of the image in another site. There's a little bit of performance overhead on the primary because we're maintaining this write journal. In reality, people mitigate this by putting that write journal in a separate SSD high performance pool. So if you weren't already doing that, it might actually be performance win, sort of depends. But you'll have more infrastructure to support it. Honestly, I actually can't remember if we implemented this or not. At least in principle, you could configure in a delay for that replication. So if you wanted a five minute old copy, in case you accidentally fat finger and destroy your primary copy, you could configure that as well. But this RBD mirroring feature has been supported since Luminous. And since then, we've been improving the tooling and automation around it. And then if you have a failure on your primary site, you lose your cluster A. The whole site goes down, goes off the network, or SEF explodes, or something like that. Then you have your backup image in the secondary site. That's point-in-time consistent, because it's ordering the mirroring of writes across. The image in the second site appears as though there's a crash or something and you're just rebooting. So it's a point-in-time consistent copy. It's asynchronous, so you might lose the last few writes. But depending on what your latency and bandwidth limitations are on that link, presumably, hopefully it's not that much, maybe a few seconds. And then you can restart your application in the secondary site. And the whole lifecycle is considered here. So if the primary site comes back up, then there's all the code to sort of resynchronize. It might roll back the writes that got lost and then resynchronize with the new writes that happened in site B. You can flip the master back between the two sites. All the tooling and so on around that exists. So it's sort of a complete, robust solution. So the first primary consumer of this is OpenSack Cinder. Over the last several releases, there's an addition of RBD replication drivers, investment in a lot of the tooling that deploys it and configures it over Ocata, Queens, and Rocky. But it's not a complete, least satisfying solution. So there's a lot of tooling around setting this up and configuring it. And the key thing is that even if you have this all set up in OpenSack and you lose an entire OpenSack site, your storage will come back on the secondary site. But it doesn't set up all your Nova attachments. It doesn't actually restart your VMs. And I think this really highlights the limitations of how much you can do in the storage or infrastructure layer without the cooperation of the layers above it. It's hard for an infrastructure layer that's just providing VMs and storage to know what to do with your storage, even if you implement all these fancy capabilities of doing migration or disaster recovery. What you really need is a complete picture of what your infrastructure looks like, how you define your applications, and how you set up your entire application stack, which is really what you get with Kubernetes and why we're excited about it. So more on that later. So that's what we can do with Block. So we can do lots of stuff within a single cluster. And we have this Arbidimuring capability across multiple clusters. Cephile also has Cephifes, a complete distributed file system. So Cephifes is awesome. It's been stable since Kraken. We've had multiple metadata servers for several releases now. We've had snapshots supported since Mimic, which is the previous release. Cephifes supports multiple RATOS pools. So within RATOS, again, you can have different pools, backed by SSDs or hard disks or in different racks and rows, data centers, all that stuff. And in Cephifes, you can map a particular sub directory to any RATOS pool. So the new files created in that directory get stored in that location. Cephifes is fast and scalable. It's got quotas, volumes, sub volumes, all this good stuff. And you can provision it with Manila or with Kubernetes PV persistent volume drivers. So Cephifes is great. It's a little bit different than RBD in that an RBD client talks directly to all the OSDs that are storing the data. So you have sort of a direct data path between the RBD client, either the kernel that's mapping the device or the virtual machine that's consuming the virtual disk. In Cephifes, you also have a direct data path for file data. But when you're interacting with the file system namespace, directories, doing reader, create, rename, open, all that stuff, you're talking to a metadata server. So you have a second set of metadata servers that manage the file system namespace, coordinate access to all the files, make sure that the clients are cooperating and doing the right thing. So a slightly different model. So the first question, can we take Cephifes and stretch it across multiple data centers the same way we do RBD? And the answer is yes, you can do it. But it has all the same limitations. So if you stretch it across data centers, you have to be careful about the latency, the failure domains, all that stuff. So the same issues apply. So use caution. And in addition to that, in RBD, while you had a direct data path to the OSDs, in Cephifes, you're also talking to the metadata servers. So you have this additional concern that you want to make sure that if, well, you have this problem that if the metadata server that's active is on the other data center, then all your metadata access is going to go across the link. And for file workloads in particular, metadata performance tends to be very important, at least for in general, general purpose file workloads. Maybe if your application is not such a concern. But you need to pay attention to where your metadata service are placed. So again, stretch, you can do it. Maybe it's not the panacea that you'd hope it was. So what can we do sort of beyond this? That's what Cephifes will do today. What are the things that we could do with Cephifes to improve the multi-cluster capabilities, since we're going to talk about a couple of different options. So the first thing that we could do, and that we're talking about doing, is something called snap mirroring. And the basic idea here is that we already have a robust snapshot capability in Cephifes. You can take any directory in the file system and create a snapshot on it. Any sub directory at any time, you don't have to pre-plan your volumes or anything. And those snapshots provide a very fine-grained point in time consistency. And additionally, Cephifes has something called RStats, which is recursive metadata on any sub directory in the system. And one of those attributes is called RCTime, which is essentially the most recent modification of anything beneath a particular point in the file system hierarchy. And if you find these things with something like RStats, something that's just copying files across, then you can build a system where you create a snapshot periodically, say, every 10 minutes of a part of your data set, and then use that recursive C time to identify efficiently what the changes have been in that sub directory, and then quickly just copying them over to a second cluster, and then create the same snapshot when you're done on the second cluster. And so you could set up a schedule where every 10 minutes, you take a snapshot, mirror all the metadata across, and then create the same snapshot, and you have this sort of disaster recovery type situation. This matches pretty closely features that exist in many existing enterprise products with similar names. And it's not that hard to do. So the main missing piece is that if we need a flush operation so that we know that that RStat value can be trusted before we start synchronizing stuff, because they're normally lazily updated, there's actually a pull request in flight to do that. And we need to modify RSync so that it can use that RC time to efficiently identify changed files before it's synchronized to it. And then you want to build some scripting and tooling around automating the process. But this is something that you could do and it would match existing enterprise features that people are used to. But it raises the question of, do we really need that strong point in time consistency in order to create a disaster recovery? Well, the easy answer is yes, because you have no idea what's running on top of your file system. It might be a database. It might be something else. And applications consuming the POSIX interface might require that consistency. The reality is that actually that's not always the case. And in fact, many other distributed file systems have geo-replication features that don't provide strong consistency. And people seem to be totally happy with them. So maybe they provide a consistent copy of a particular file, but they don't provide coherency in ordering between files that are updated. And it turns out that a lot of applications that are using file storage and would benefit from having a geo-replication disaster recovery type capability don't actually need that strong consistency. Maybe they're just like storing images. And they don't care if they store two images in a particular order, and they only got the second one on the resulting site. And often maybe you're just casually using the file system and you don't care. So if we decide that we don't need that strong consistency, there are other things that we can do. A simple idea would be to have an update log that is generated by each of the metadata servers of all files that are changed, and feed that to a bunch of workers that can just copy those files across to a remote system. And this has some benefits over the snap mirroring model because you get more timely updates. As soon as you modify a file, you get copied over to the other side. And it should be able to scale pretty well. The limitation, of course, is that it doesn't give you point-in-time consistency across multiple files in the file system. So it only is suitable for certain applications. And there are some implementation challenges around ordering and making sure that you actually do the mirroring properly. But nothing that's terribly difficult. So we have options. So at this point, I'm going to take a quick aside and talk a little bit about what it means to satisfy the migration use case. So consider you have an application and storage in one data center, and you want to move it to another. How would you actually go about doing this? And sort of independent of file or block storage or object storage, what's the basic process? So the simplest model of migrating data is just that you have an application and data inside A. You stop it, you copy all the data, and then you start it up again inside B. That works fine. That's what we've all done in the past when we've had to move things around between sites or servers. Your application doesn't have to be modified. It's the exclusive user of the storage. But there's a long service disruption, because you actually have to stop it while you're doing the migration. So certainly not very satisfying. You can improve on this situation a bit if you sort of pre-stage the data. So while your application is still running inside A, you might copy, make a full copy of the data, but it's sort of getting a little bit stale. So it's like 99.9% up to date. Then you stop the application and do a final pass on the data to make sure you get those last few changes and then start it up inside B. So this is an improvement. You shorten the availability gap while you're shutting down the application. And in fact, with the Arbadi async mirroring, you might be able to do this in a pretty efficient way because you could set up Arbadi asynchronous mirroring to site B. So it's got let it build up its copy. It's got basically everything, but it's a few seconds old. Pause the application, switch the Arbadi master to the other site, and then start the pre-application. So it's a bit better. But you still have this availability gap. What would be even better is if you could start up the application in the second site while it's still running in the first site, and then you can seamlessly switch over traffic from one instance of the application to the other. That would be great, but it needs this ability to temporarily be active active in both sites. So if you can do that, then it's awesome. You have no loss of availability. You have concurrent access to the same data. If there's any performance degradation, it's only during that interim period where you're replicating it across both sites. And in fact, that's really just a generalization of the case that you can actually do this active active replication in general. That's sort of the key hard piece of this process. And if you can do that, you can have highly available access to your data in both sites, and it's awesome. But it brings up key questions. Can your application tolerate concurrent access to the same data? How are you replicating the data across those two copies? Are you doing it synchronously or asynchronously? And what's the consistency model? If you have those two instances, they're both trying to modify the data, what happens if they collide or can they collide? So these are key questions you have to understand about your application behavior in order to do this sort of thing. So coming back to Cephapus, again, the panacea would be, and the thing that everybody asks for and wants when they hear about a new distributed file system is, can I just have my file system replicated in multiple data centers and have it active and available in all data centers, and everybody can access it whenever they want, and they get good performance, and everything's consistent, and everything just works. That's what people ask for. And you always sort of shake your head and say, maybe. And the reality is that we don't have a general-purpose bi-directional or multi-directional file replication protocol. And the reason is it's really hard to do that with POSIX. The file system interface in Unix and Linux is extremely complicated, and there are a million opportunities for conflicts. There's conflicts on file data. If you're modifying the file in multiple sites, one of them maybe truncates the file, the other one overrides part of the file, and then you try to asynchronously mirror those. What do you do, and how do you reconcile those changes? And there are possibilities for conflict on things like rename. If you have two directories and two different sites rename the directory into the other one, it's an impossible conflict to resolve. You have to pick one or the other, and then the other end is confused or whatever. So in a general sense, doing asynchronous replication with POSIX leads these conflicts, and there's no easy way to resolve them. But the reality is that you would never actually do this unless the applications are specifically designed to cooperate in the storage. You don't necessarily have arbitrary users doing any operation and situations where you're trying to have them consuming the shared storage. You wouldn't design your application that way because it would immediately break. So if you have an application that's using storage that's designed to cooperate in the file layer, such that they carefully avoid this type of conflicts. A good example of this is the mailed or storage protocol that mail servers use. They have a particular scheme where they write new messages and rename them between directories that is able to tolerate conflicts when they're using an NFS server. Then that's great. So the other classic example would be that you have content source. You're just writing new files. You're reading existing files. You're not renaming them. You're not doing anything like that. And as long as you're doing sort of a simplified set, a subset of the operations that you can do in POSIX, then you're fine. You can use a shared file system. Your applications are going to conflict. You can scale out your application heads. But I think the thing to realize is as soon as you start to simplify the set of POSIX operations that you use, it starts to sound less like file storage and more like object storage. And so it begs the question of maybe that application shouldn't really be using the file layer to get this sort of magic multi-directional replication. Maybe you should just be rewriting your application to use object directly. So let's talk about object storage and a little bit about why that might make sense. So why does object storage? So great. It's based on HTTP, which means it operates well with web caches and CDNs and so on, lots of existing libraries. Object APIs give you atomic object replacement, which means that if you put a 10 gigabyte image file to replace an old one, only when you're done and you're sort of completely written the file or the new object, does it atomically replace the previous version so you don't have sort of existing edge conditions there. And the fact that you can't sort of overwrite the middle of an existing object means that the implementations are much simpler. The consistency model is much simpler. You either have the old one or the new one and they're completely different and so on. The namespace is flat, so it's very easy to scale out when you're actually designing the back-end architecture. And there's no rename. And so it's a simplified storage model that lends itself very easily to simplified consistency models and replication models and implementations. And it's sufficiently powerful that you can do a lot. And I would argue also that there's going to be a lot of object storage in our future. So I'm not going to say that files are going to go away. In fact, I'm saying it definitely won't go away. File is critical. We have a huge existing set of applications that consume file. And the file interface is really useful. There's a lot of stuff that the POSIX API gives you that's genuinely useful to applications. It's just not for everything. Block isn't going away either. It's sort of the backbone of all the virtual machine and container infrastructures. It's how things boot up. But most of the data that we store in the future, the zettabytes and zettabytes of data that the world is producing, isn't going to go in a block device or a file system. It's going to land in object storage. All those cat pictures, surveillance videos, medical imaging, video telemetry, all that stuff, it's going to be stored in objects. And the next generation of applications that people are designing are going to be storing all of their bulk content in object stores as well. So it behooves us to pay attention to how we're going to solve these more complicated problems around multiple sites and consistency in the object storage world. So the RATOS gateway, which is Ceph's S3 API, has a rich set of federation and replication multicluster capabilities today. So the way that RGW views the world of federation is through zones and zone groups and namefaces. So a zone is a set of RATOS pools that are storing RATOS data, or S3 data. So it's a collection of RATOS pools in one cluster and a set of RGW demons that are serving up that data. So usually probably what you think about RATOS gateway deployment today is just one zone. A zone group is just a collection of several zones that may be spread across a single cluster or multiple clusters that have some sort of replication relationship. And you can do active passive. You can do active active so you can put in either zone and it'll replicate in both directions. And then a namespace is a collection of zones that have the same set of buckets and users defined. So you can think about the Amazon S3 as having a single global namespace. If you create a S3 user, it exists everywhere in the world. A bucket exists in one site. But if you try to read a particular bucket, it's always going to send you to the right zone or the right region to read your data. So it's a global namespace. So with RATOS gateway, you have the same concept, but you can create multiple namespaces if you want for different organizations or teams or whatever it is. And then the key thing with one of the points with RATOS gateway federation is that the failover between sites and zones is driven externally, at least currently. So the idea is that it's very easy within a stuff cluster to tell when something fails and what to do about it because you're sort of within a single data center. If you have multiple data centers and a data center goes down, it's harder to sort of have this extrovernal view of who's down and who's up and where you want to move the master so RATOS gateway doesn't try to solve that problem itself. It assumes that something above it, whether it's a human operator or some other automation, is going to decide how to deal with that. So just to view this visually, you might have three different stuff clusters. Each of them has two zones. Some of them might be grouped into replication groups or zone groups. So XA and XB might be replicating all the same data. Other zones might be standalone. They're not replicating with anybody else. They're just sort of independent collections of buckets and data. Other ones might be replicating and so on. And I guess that sort of the key limitation with this architecture today is that the initial set of use cases that we built this for were for operators who are operating clusters and dealing with disaster recovery use cases. So the granularity of replication in our GW today is on a per zone basis. So you say this entire cluster of buckets is going to be replicated to an entire other cluster of buckets. And so if my whole data center goes down, then my object storage service is still up. The way that it's internally implemented, all the data structures and internal behaviors are written such that we can also do it on a per bucket basis. But all the tooling and automation and so on is still done on a per zone basis. So we have some additional work to do so that you can say, I want this single bucket to be replicated to another site and this other one not to be. So that's one of the initial gaps. But in the current state today, it solves the disaster recovery use cases and it solves the stretch cases where you want to have the same data set across multiple data centers. You can access the data, read and write it in both locations and data will replicate in both directions. And going back to that active active case I was talking about a minute ago with SFFS, RGW also has an NFS export capability. That was originally built with the purpose of just copying data into the object store or copying it out. So you can sort of transition applications on the object. But if you wanted to, you could export both sites, the NFS, and read and write to both of them and you'd have this bi-directional replication. As long as you sort of follow the rules because POSIX is a rich API, Object is a very simple API, and so you can't do everything. You can't run an arbitrary application on top of the NFS export. You have to only write complete files and read files and do no renames and things like that. But you could do it. But you would be better off just making your application directly consume the object storage because that's really what it's doing. It's using the storage as a bunch of assets that are immutable. Use an API that matches that, and you'll be happier. The way that the Rata Skateway Federation is designed is pluggable. So the usual behavior is that a zone is replicating data from another zone or other zones in its own group. But you can plug in different behavior. So back in Luminus, we added one for Elasticsearch. So you could have a zone that instead of copying all the data and metadata, would just take all the metadata and it would dump it on Elasticsearch and then present a query API. So you could do searches over object names and sizes and owners and so on. You can define indexes over custom metadata that you put on the object so you can do searches. So that's existed for a while. In Mimic, which is the previous stuff release, we added a Cloud Sync plug-in, which allowed you to have a zone that would essentially take data from the other zones and it would push it all into S3, either by taking all the buckets and putting them all into S3 or mapping them all into a single S3 bucket or maybe a subset of buckets. And it would do some best effort to try to rewrite the ACL so that it would be still usable in the S3 environment. So that's just replicating out to S3 in Cloud Sync. And in Nautilus, we're adding two new ones. One is an archive zone, where you basically define a zone that enables bucket versioning. So we preserve all copies of an object, past copies, and it'll just copy everything and it'll create an archive of all the data that you've ever stored over all time for archive backup purposes. And then the other new one is kind of exciting. It's a PubSub zone. So it essentially builds an index of all the updates that have happened and then presents a set of APIs so that you can subscribe to events. So they did a great demo of this in a talk for KubeCon in Seattle where we need to do a put into an object. It generates an event that then gets fed to Knative, which is a serverless framework for Kubernetes and triggers a Blamda function to go do something. And so you can do, you can trigger events through the object store using the PubSub stuff. There's also work to feed this into Kafka and to AMQ also. So there are a couple different models you can use it. But that's also new and not less. And the way we're thinking about this is not just in terms of stuff clusters that are backed by bare metal sitting in your own data centers. So for example, with the Cloud Sync module, the idea is that you're going to have data stored in lots of different sites locally and also in public clouds. And when you're using public cloud storage, you would presumably, you would have a sort of a thin stuff footprint in that cloud, an EC2 or whatever. That's just managing sort of a gateway role and storing some internal state about what is replicating locally and so on, what his role is in the federated mesh. Because the reasoning here is that you're never going to compete with actual S3 by building an S3 service on top of like EC2 instances. If you can, it's because Amazon totally screwed up their pricing. They're optimized for efficiency. And so what you really want to do is if you do have a footprint in a public cloud service, then you want to leverage their storage services that they're operating efficiently and optimizing for price and so on. And you just really need that gateway in order to access. So when we view the RATUS Gateway Federation, you're going to have thick sites that are on premise that are actually storing lots of data. And then you're going to have thin sites that are just gateways to an external storage service that you're making use of. RGW can also address most of the tiering use cases. So today, you can have multiple RATUS pools within a set cluster that are mapped to different storage types. And when you put an object, you can choose which tier that object is going to go to. Or you can set a policy on the bucket so that new objects that are written on the bucket will go to a particular tier. And currently, there's a limited ability to set some policy to automatically expire objects. And in Nautilus, there's a new feature in RGW that implements the Amazon S3. I believe it's called the life cycle. Bucket life cycle, object life cycle. I always mess up the term. But it's basically establishing a policy on the bucket that automatically adjusts the tiering. So maybe your objects land in one tier initially. Over time, they move to another tier and then eventually they're expired. So that's new in Nautilus. And it will do the tiering between different RATUS pools. In the future, we'd like to extend this. So that when you're tiering, you can tier not just within the same cluster to different RATUS pools, but you can tier to other stuff clusters and to external object stores so that you could have something initially land in your local cluster and then get pushed out to S3. Maybe it pushes out the glacier, something like that. That's the plan. So lots to do. So trying to take all of this information I've thrown at you around RATUS Gateway as far as what we have today and where we're going. So today, we view RATUS Gateway as a gateway to a single RATUS cluster, so as the name implies. And we have a bunch of geo-replication features tacked on, but it's mostly a gateway to a cluster, is the way we think about it. In the future, what we'd like to do is shift to a mode of thinking where the RATUS Gateway is really a gateway to a whole mesh, a whole collection of different sites. And so when you're talking to the gateway, maybe the data that you're interacting with is local. Maybe it's remote. Maybe we're proxying your requests. Maybe we're placing it somewhere else. But really, it's a federated topology mesh of different sites. Today, when you talk to RATUS Gateway, if it's not sort locally, it'll issue a redirect that bounces you to the right gateway for the other site where your data is located, similar to what you get with AWS S3. And there are some tricks you can do with dynamic DNS to sort of get you to use the right DNS to the right site that people have done. In the future, we might do that redirect or we might actually do a proxy, right? So it might be that you have a site where your application is always talking to one gateway, and it's just actually proxying you to remote sites where your data is. So you don't really have to think of where your data is placed and where it's in migrated to, and we're sort of seamlessly moving it around behind the scenes or applying some policy so you can get your data. And finally today, RGW replicates its own granularity to sort of address those disaster recovery use cases. But in the future, we want to make that more flexible so that you can, on a per bucket basis, decide where a bucket is placed, whether it's replicated and across multiple sites or whether you're migrating it between sites so that you can take one application using a bucket and move it to somewhere else and move the data with it without having to move the whole site. OK, so sort of the last set of scenarios I want to think about is set at the edge, almost out of time. So actually, I'm going to skip the edge sites because it's something interesting. So why do I keep copying about Kubernetes? I think the key thing to remember is that regardless of what the storage can do, you need to move the applications with it. So it's truly giving you your application mobility as a partnership between moving the application that's consuming the storage and the storage moving the storage, which is why we're so interested in Kubernetes and the Rook project, which runs a self-cluster in Kubernetes in the same way. And that sort of falls into the persistent volume category, which is file and block storage that are attached to your containers and increasingly object storage where we're building the tools to sort of automatically provision storage in Kubernetes and buckets to attach to applications in the same way you do block and object storage. So lots of stuff going on there, and that's sort of where we see the future. So summarizing, bringing it all together. Data services are really about this mobility, introspection, and policy. Stuff already has a lot of key features that give you some of this, but also there are some clear gaps. We haven't solved all the problems yet. I think we're in good shape on the block side. On the object, we have a lot of things solved, but there's a lot of work to do. And on the file, there's even more to do. Our key efforts are really driven around defining what the use cases are for in the Kubernetes environment, around RWO and RWX persistent volumes, dynamic bucket provisioning, and so on. And sort of addressing the features in the storage system that we need to make all that stuff work, integrating it with Kubernetes. That's really the way we're approaching it. And that mostly comes down to extending RGW. And also, we're planning and deciding what we want to do for set of S, what multi-cluster features we want to add in the future. So sort of the bottom line, if you talked to me two years ago, and I talked about what CEP is, it's software-defined storage. It's a unified platform with file block and object, hardware agnostic. Really thinking about CEP in terms of a single instance in a data center. And as we sort of move into the future, increasingly we're thinking about CEP in the context of you have lots of clusters across lots of sites, and you really need a platform that lets you manage your data across all of them and migrate data and so forth. So we're increasingly thinking about these multi-cluster, multi-cloud use cases.