 Hey, everyone. I'm Joe Arnold from Swiftstack. Timor, I'll approach, also from Swiftstack. And we're going to talk about hybrid cloud storage with OpenStack Swift. So we're going to get into the motivation of why we started getting more involved with doing hybrid cloud deployments. Then we're going to dive into a deep dive into how it works and all the mechanisms, and then go over some of the use cases to talk about some of that. So let's get started. So first off, what are people looking to do? We're seeing that people need to store a bunch of data on premises. But then they also want to use some of that data in a public cloud environment. And so what's in the public cloud? You have different services that applications can use to process data, whether that's bursting out for compute jobs. There's all sorts of data services in things like media and life sciences used for data processing. A lot of times people are building applications in the public cloud where they may not feel comfortable running that application on their premises with the data center infrastructure that they have. So they find they want to deploy that application to run it as a service. They want to synchronize some data to distribute the data out in the public cloud. Or kind of more, not necessarily mundane, but just have a second site for their data. And so our thinking about this was, how can we have the data that's in an OpenStack Swift cluster be on-premises and in the public cloud and be able to synchronize those two so that you could have the same data in both places? So that's sort of the motivation. Now, from a real practical consideration, is we found that people couldn't move all of their data. Even if they wanted to be in the public cloud, they had a hard time moving it all at once. So lifting something up, shifting 100% of the stuff out into the public cloud, was just kind of untenable. Because you'd have a lot of data to move. And moving data is there's a certain gravity to the data that they have. And so moving that all at once was really hard to do. Or you had a situation where the data wasn't necessarily generated outside of the system. It was generated with all the equipment, like the list of logos that we just showed, the sequencing machines, cameras, scanners, sensor data. All that stuff is being generated on that network. So mine is we'll store it on there. You don't get the additional transit cost. But the biggest issue was the complexity of trying to change everything all at once. How do you do something in an incremental way? How can you have an application suite that's running on-premises, part of a workflow that's already established that's been ongoing, but then pick up some part of that that might be compute-attensive and then bring it up? So that's sort of the motivation. Then we started looking at what people were doing in order to move the data out. And there's kind of two approaches. One was they were either using tools that would copy the data, like some client tool, to move the data. But then you're subject to bandwidth limitations of that individual workstation or client to move the data. Or they were taking a gateway or using some of the NAS products to move some of that data into the public cloud. But the downside there was you couldn't see the data in the public cloud because it's not represented as a native object format. So we were thinking, OK, how do we take the data and store it in an Amazon environment or a Google environment in its native format? So when you actually go look at that bucket, you can see your data. And it's not something else. And then that what allows us to do is take those workflows that are on-premises, that see that data, and then they can have access to that data as they're building the application to use for more cloud native applications that they're building. And so then what we're going to dive into here is how we go and do that. And so what that enables is, first, is a second site archive for the data. So being able to say, hey, here's some set of data that's on-premises. And I want to replicate that data to the public cloud. And whether I'm going to just let it sit there and archive it or I want to have a second site for maybe content redistribution or sharing that content with other users or the middle option there, which is elastic hybrid cloud compute. So taking advantage of workflows that can spin up beyond the compute capacity that you might have on-premises, so bursting that out or finally doing large data set collaboration where think of like, we work with organizations that have multi-terabyte data sets that they want to share. How can we do that where they're not necessarily exposing that from their own network and synchronizing that data up to a public cloud so then they could share that with other folks that they're working with? So when we started building this out, we thought about different ways that people would want to move data between the two environments. So there's sort of like more fundamental profiles that we thought about. So one is sync. So keeping two namespaces synchronized with each other. So having the on-premises sync either one way or two way. So an update happens in one, and it synchronizes in the other. The other is move, where data is put in one location and then moved. Again, this sounds pretty basic. But these are the building blocks that we have to work with. Move the data from one site to another. Or starting to get more sophisticated is access. How can we have an understanding of the public cloud and the on-premises data so we can make a request and then that request says, oh, the data isn't on-premises. Let's make that request in the public cloud and then fetch the data. Or if the applications in the public cloud and the data's on-premises, how do we do the reverse? So we can make requests in the public cloud and then as needed, make the request and pull the data in. So those are the building blocks that we're going to go through over the next few minutes here. Right, so we're just going to kind of dive in a little bit into the architectural overview of what we're trying to pitch and build. So at the pie level, it's what Joe just described. We have applications and users who envision on both sides of the diagram. So we have someone who could be accessing this from Swift on-premises and we could also have users and applications running in Amazon S3. It could be Google Cloud Storage. That's the two providers we looked at so far or it could be something else. One of our goals as highlighted in this diagram is we need to have native objects on both sides. So we're trying to stay away from kind of the existing model of having the gateway where you have to run this gateway in the public cloud and that's the only way you're going to reach in and get at your data. We want to preserve the native format so that you can leverage EMR or whatever existing workflows exist in Amazon or Google or whatever else are going to roll out. So as part of this, I'm just going to go through kind of the, at the high level, how do we handle different operations? I'll go through get, puts and deletes and we'll talk about some of the challenges there. So when we have an application or users in the cloud trying to request some data, Joe alluded to this notion of access. So some data may be on-premises and not available in the cloud immediately. What we envision is having a Sustak access node is what we're calling it, which is a piece of software that's running in, in this case in Amazon. And this essentially is a data router. So when the request comes into this node, it will redirect the request on-premises realizing that hey, this object does not exist in S3 right now. Then when the request returns, we'll place it in S3 and we'll also return it to the caller and then subsequent requests can be made against S3 directly. And this also enables you to run any additional pipelines by the way that it's going to be a bit faster. You're not going to be paying for additional latency penalties in the processing any of this data that's already been fetched in. Yeah, we wanted to do this in a way that made it so that the data that actually ended up into a public cloud bucket was still accessible and it was still in the same namespace. You could actually go and use, so if you're using the S3 API on the on-premises, you could still use the S3 API in the public cloud, but you wouldn't necessarily always need to go through that access layer. You can actually go directly to S3 itself to get the data. Right, and the thrust of it is that you may not want to forklift all of your data on-premises and place it into the cloud. I mean, that's kind of the original problems we're trying to get around, so this allows you to on-demand pull in some of the assets that you're going to be processing. Then when a similar situation arises, let's say you've run your pipeline, you have created some assets in S3, now you're actually trying to pull them down and access them. When that request comes in, we're going to propagate that request against the public cloud storage. Same scenario, the request comes back, we're going to place it onto Swift so that in the subsequent cases, we don't have to go to the public cloud anymore. So the kind of the high level, that's our way of trying to handle GETs. They're pretty similar between public and on-premises implementation. The more complicated case is when we start talking about puts and deletes, and we're going to get into some of the overwrite circumstances as well. So in this diagram, we're trying to place an object into Swift on-premises. We're introducing a new notion, or the way we call this product is Cloud Sync. This is a process that runs on your container nodes, and the whole purpose of this demon, or process, is to asynchronously propagate any changes you're making in Swift into public clouds. So in this case, for example, we're propagating this single put into S3 and Google, but we're not doing this inline. So these are async processes that are basically going to be running in the background, catching up to the state of your Swift cluster. The slight detour here is I wanted to actually dive into how this works, and how this Cloud Sync process operates on Swift. How does it figure out, which objects do I need to propagate? So the specific thing we're trying to leverage is the Swift container databases. Each container database is represented as stable. It has not all the columns, I think, that are in the table, but most of them, at least the ones we care about. So for example, if you're placing a bunch of objects into your store, you may also delete some of them. Swift will record in a SQLI database the name of the object, the last modified date, and it also keeps a flag telling you whether this object has been deleted as essentially a tombstone that it can propagate across all the container nodes. We can leverage this information because each object name is going to be unique in the database. So we're not going to have, for example, photo one that JPEG appeared multiple times in this container database. And we also have an additional guarantee from Swift where each entry in this database is going to be in chronological order as far as this node is concerned. So if there are additional updates, we can continue reading after, for example, in this case ID 44, and we will pick up any new changes in this particular container. So let's kind of walk through how this works and how this cloud sync daemon leverages this. So for example, we will place a new photo into our Swift cluster. We have this database that has the currently rows 42 and 43, we have some photos in there, it's great. The new photo shows up, it's added to the container database, there's a row 44, the cloud sync daemon makes a request to the container database asking that, hey, I've got everything up to row 43, but what am I missing? Like what, are there any updates after this? So at this point, row 44 is returned to the cloud sync daemon, it knows that this is a new object, I need to go propagate it, and the cloud sync daemon will place it into, in this case, Google cloud storage, could be Amazon or another S3 compatible cloud. And now similarly, if you'd make a put in a public cloud, we actually would want to walk through what is happening on that side and how can we propagate these updates back into Swift. Obviously we can't reach into Amazon and ask them for any sort of equivalent of a container database that we'll be crawling, it probably wouldn't be super amenable to that, same with Google, but luckily they have primitives that we can leverage that allow us to perform kind of similar tricks. So there isn't a need to have any bit of infrastructure in the public cloud in order to do the synchronization from the public cloud back to the non-premise environment. Right, so importantly we're not going to be adding additional software that needs to be running in Amazon or in Google, we're leveraging only the primitives that are already available out of the box in all of these clouds that you're free to use today. I think the other kind of non-intuitive thing about this is that means that the on-premise storage doesn't necessarily need to expose a public route to itself and because we're making the request to the publicly available services in the public cloud. Yeah, so you still have this nice separation where your Swift cluster does not need to be exposed to any sort of internet access. The entirety of this model works as a pull out of this cloud or push to it, nothing in the public cloud is actually reaching into your on-premises storage. So to kind of walk through this example, we have a put operation, we're going to be uploading image one because I'm not super creative with names and this image one is placed in S3, we've configured our S3 bucket for essentially issuing notifications into the simple queueing service, that's the SQS component and that will include, okay, there was a put and here's the object name and at this point, our cloud sync demon can interrogate the simple queueing service and ask it for, hey, what do you have currently? What are the operations that have been done so far on this bucket? Once it gets back, the list of operations, it can go on and perform all of the updates. I wanted to highlight also how this works in the same world, but with Google, turns out it's mostly the same but the icons are different. So Google's thing is called Google Cloud Storage, not surprising. The other icon is their pub subservice and turns out it works pretty much exactly the same as the Amazon simple queueing service, maybe not super surprising. At which point our cloud sync demon can talk to the pub subservice and ask the same question, get back the same answer and operate on the return result. So then, okay, how do we deal with the case where there's, I mean, the cluster is not just a single node on premises, there's multiple container servers that are running in the system, how do we deal with having multiple instances of those that are all being updated at the same time? Right, we need to tackle the problem of overrides and eventual consistency and how do we make sure that we don't have stale data propagated in Ozium everywhere or that you actually get the results you expect? So I'm gonna start by talking about eventual consistency just from the side of propagating data from Swift and into our public cloud. In this case, it's three, but it could be something else. So imagine we're gonna put an object of version two into our bucket and we're gonna write this object of that correct version into S3 and everything's awesome. But at some point, another instance of this cloud sync demon is running on another container node and these two container nodes do not have to be in sync with each other at all times. That's kind of one of the nice properties of Swift. You have this high availability awesome system which may occasionally give you stale results. We've engineered our cloud sync demons to not be cognizant of each other's existence. So they don't actually introduce any single point of coordination. They don't talk to each other to synchronize, okay, we need to make sure that this is the right version. So at this point, you might actually get a stale object placed into S3 as the prior version of this thing might show up on another container node and this is propagated into the bucket and this demon thinks it's doing the right thing and it's pretty great. And it seems like a bad situation to be in, but fortunately Swift helps us. So eventually Swift will actually communicate between these nodes and propagate the update of the container database from one node to another, at which point this new version of the object will make it into the updated container database. And I'd like to circle back to the discussion of container databases where each object only appears once. So when this update happens, there's a new row appended and that row will have the name of this object and it will have the correct e-tag and modified date and at which point cloud sync demon will say, oh, hey, this is the new row, I have not processed this yet. Let me go copy this over to S3, rectifying the staleness of the data in the public cloud. I mean, the other attribute of this design is like I was alluding to earlier where you have the whole cluster participating in the synchronization and typically in the deployments that we have, that means lots and lots of nodes all pushing data at the same time so you can reduce the time that it takes to get the data that you want synchronized up into the public cloud. So much so that we have to start thinking about how do we put limiters in the system to not use so much bandwidth in the system? Yeah, it turns out people don't want you to be clogging their internet connection with your updates to S3 all the time. So the other scenario I want to talk about is propagating data from S3 and into Swift. We can run into similar issues here. Luckily, we can rely on one of the cool features in Swift called X timestamp headers and this is a way for us to essentially propagate an object's date back into Swift when we're copying the object out of S3. So for example, if in Amazon we have this object, it's got a timestamp of January 1st of this year, we'll find the object, we'll copy it over, we'll use the X timestamp header and we'll set its date even though this action might be happening in the future. So let's say it happened yesterday. If at some point you're placing another object into the system, Swift will also assign a timestamp and that will be the timestamp from the operation when it actually completed. This way we can ensure that these stale objects from S3 which may come in with a new timestamp otherwise in Swift will not actually cause you to have stale data in your Swift cluster. So as the updates are propagated, Swift will update the container and the object node with the new object that you've just placed and you're saved from kind of having these stale objects from S3 you're living on forever thinking that these are the latest versions of everything. And the last one is dealing with overwrites in the public cloud itself. So of course you could go into your street bucket, take your object and overwrite it a bunch of times and somehow we're gonna have to figure out what's the latest version. And it turns out when you go to S3 and you've done the overwrites a few times, you're not always guaranteed to get the latest last version of this object. Turns out we can leverage bucket versioning which is another feature that Amazon allows you to enable by default. And what it allows us to do is if we had image one in S3 and we overrode it, when an SQS notification happens it will include the version that is actually in this bucket. So now when we're querying SQS asking it, hey, what are the updates that I've been missing? And it will tell you there's a put and there's also the version of this object. Now when we're interacting with S3 and we ask it for, okay, I want image one but this specific version, you may get a 404 back because that's how eventual consumer manifests itself at this point. But you're not going to get back a prior version of the object. That's the guarantee that you get back from Amazon in this case. So you could retry and eventually get the correct version, hopefully. Turns out overrides with Google are super easy because they're strongly consistent on overrides. So that was very cool to engineer. And the last bit I wanted to talk about which is maybe the least fleshed out but also it's something that we're tackling as we're kind of moving along. Joe alluded to three use cases that we have. Archiving, cloud bursting or elastic computing in the cloud and collaboration. When we're thinking about these workloads we're also thinking about we need to somehow express notions of policies of rules to our containers and objects in them. And how do we map these policies which could be numerous to these use cases? So luckily it turns out for the three use cases we have there's only a small set of policies we need to express. One for the cloud bursting collaboration we're gonna have to propagate deletes from Swift into the remote cloud or the other way around. For example imagine a case where you want to share a file let's say I want to share an object with Joe. It's pretty great and then I realized that was a bad idea and I need to go get rid of it. Not propagating that delete into the bucket would allow Joe and whoever he gave the temp URL access to to continue to download it forever. For cloud bursting it's kind of a similar thing. In both cases you may want to set an expiration time when you're actually propagating these objects to other clouds. So that's why we need the expire objects at destination policy. And for archive in some ways it's one of the easier policies where we only need to expire objects locally and at that point we will consider objects to live on forever in the cloud. Sorry. And then I think just last few slides here just drilling into a few of the use cases. So second site and this is having a public cloud archive for on-premises data and this is enabling the ability to just set up the profile, set that to a bucket or a container and then synchronize that data to either something like a Google nearline or a coldline or into AWS three or have that tear off into a Glacier product. And then the system just behind the scenes, whatever data gets placed into that bucket, it synchronizes it over. And it's nice because it doesn't require any additional software up in the public cloud to peer into that. It's just there in the native format. The other thing that we're seeing is customers that have a compute intensive job on-premises that they need to burst out. So we've been working with like the Cisco CTO group has put together a dockerized workflow with Swiftstack. And when people are creating lots of Docker containers to run a job, often times it's really hard to manage an amount point that needs to be distributed across lots of those containers. So it's pretty natural to use an object API to get data and put data into the system for persistence. And that means that they can use available resources on-premises to run that job. And if they need more capacity, they can run that same workflow in the public cloud yet still see the data that they might need for processing. And then on collaboration with data sets, we have a nice user interface for folks to have access to the data so they can see what data's in the system and manage that. And then that can allow them to set ACL so they can do some of the collaboration on their local environment. Or if they're synchronizing that data onto the public cloud, then they can have access to it there. And industries, we're seeing that where you have these large data sets, things like media files that might be used for data processing or collaboration or life sciences data, which are large data sets. We're seeing them synchronize that data up for those use cases. So, and then just finally, we'll be happy to take any questions, but we encourage you to try it out. We have a go to SwissStack.com slash test drive and you can play with it yourself. It's all self-service in terms of getting up and running. But thank you. Happy to take any questions. So if you have a big multi-tenant self-service OpenStack cloud, how do you deal with credentials and things to S3 or something? So I would say most of our customers will be using things like Active Directory or LDAP typically to deploy their environment. We do have a Keystone integration for access locally, but in terms of needing to bring credentials in, they'll go to a like an IAM that will need to create the credentials in the public cloud and then bring them into the management tool to put that information in there. Anything else to add? Well, I think there's one other thing that I think you were trying to get at, which is how do we propagate the credentials from like a public cloud, right? Like an S3 bucket access or a Google cloud whatever it is and how do you resolve the mismatch, right? You may have access to Swift, but it's not clear that you have access to either one of these. So right now we are allowing administrators to create a kind of storage profiles that is essentially here's my Amazon credentials. And right now it's administrator driven so they would create the mapping between gave this account can sync to this bucket for example. We're envisioning where at some point we could allow also consumers like the users of this cluster to also say, hey, I have a token for kind of this credentials and I'd like to sync this container using this token so we can at least kind of map their request to the actual S3 credentials and validate that this is proper and they're allowed to do that. Yeah, that's a good point though, but today what we do is we have the operator provision the credentialing information and then give that a profile that the end users can use. That's how we do it. For this hybrid scenario, when you sync data back and forth between on-prem and public cloud, do you see customers kind of trying not to get there because of the egress costs of taking the idea out of public cloud to on-prem? So the question is, do you see people not wanting to egress data? Yeah, because of the cost of egress. Well, and there's also, so Direct Connects also have helped with that with customers. With egress cloud. Yeah, particularly on the egress crosspoint, but you start seeing customers that have very large storage footprints in the public cloud in S3 and it starts making financial sense to bring a portion of that on-premises that might not be in active use by data processing or being served actively. So we do see that as a use case. So they actually do archive on-prem? Yeah, first, yeah. Interesting, thanks. Yeah, that's true. I think one of the other things to consider is that for certain workloads, you'll see the result of the computation to be much smaller than the input size, in which case your egress cost is gonna be much more compared to your ingress, which some doesn't charge you for. Thanks. So just a curiosity for the use case for registry. So perhaps could we make the registry to push to the local endpoint and expect to push the public cloud and something like the function or Lambda to pull out the container to deploy in the public cloud? Do you think exactly like Team Word does? Or do you? That's a good question. We've actually talked about Lambda and I think it'd be pretty cool to do some of these things where we don't have to run a piece of software in Amazon or in Google itself. I've only thought about it as far as thinking about it, not actually doing it, but I think that's something to explore. It would be pretty cool. Right, so basically on the first slide that Team Word gave, it's like if you're in the public cloud and you wanna have access to your own premises data and you want that to route automatically, you need to hit a dedicated endpoint. So can we take that same code and implement that as a Lambda function in an AWS environment and then it just turns on a request. Yeah, no question. If my on-prem applications are legacy applications which does not understand object protocol, they still speak in file level protocols, then how do I use this thing where I sync data with public cloud where on the public cloud, my application understands it in object format and on-prem it understands the same data in file format. Yeah, that's a really good question and I'd love to talk about that, but I wanna talk about that later. All right, thanks for sticking with us. Oh, one more, thanks. Sorry. So how do you deal with encryption keys? Are you doing server side or client side or both and how are you handling your encryption keys in different environments and locally? Right, so Swift supports encryption at rest when you're storing data in your Swift cluster. We can, for the purposes of actually making this useful when we will transfer data from your Swift cluster into, let's say, S3, having that data encrypted with the Swift keys is not going to be useful for any applications aside from maybe archival or DR kind of situation. So we will do the encryption at that point and then you can configure encryption keys on your S3 bucket and use the Amazon style policies on that. At least that's as far as we've thought about it. One more. Hopefully this is the last one, but I think, you know, one of the slides we showed where two sides are trying to push the same object in the context of eventual consistency, right? But you know, won't it unnecessarily, you know, increase the cost for the customer? Yeah, it's a good question, right? So I mean, that's a corner case that we have to deal with and I think we're just, we're demonstrating that we can do the resolution in an uncoordinated way when those corner cases happen, not that we expect to be the, that's the standard case. And because we, if we didn't show that, then somebody else would have come up and go, hey, how do you deal with this situation? So that's a good question though. But yeah, we do partition the workloads though so that when you do have multiple clouds and demons running at the same time, but they're not actually all checking on the same parts of the database at the same time. We will do some model arithmetic to essentially split up the total number of rows into sporkload, pair of demon. Since they don't talk to each other, you may have a situation where two of them will upload the same object because let's say the object is large, one of these starts uploading, the other one checks if the object exists and it doesn't and we'll try to upload again. But that's more of an edge case versus how the system is designed to work in the common case. All right, thanks for joining us this evening. Thanks so much. Thank you. Thank you.