 Maybe the guys pulled it back there. Yeah, we didn't touch the laptop. So you might want to talk to the gentleman in the back. Is it? Yeah, I don't know. That's my backpack. OK. Did it start? Check out the. Sorry? The laptop, no. Great. Thank you and welcome. So we're here today to talk about OpenStack Swift from one to many. And it's really about growing your OpenStack Swift deployment based on your customer needs. So you've built the Swift cluster. This customer demands. The customers have said, we love it. We have lots more data to store. Not only are you growing in a single region, but we have demands that need you to spread to multiple regions across the globe. We're going to talk about some of those considerations, some of the key things that we went through in our thought process as we started to undertake this exercise. My name is Paul Burke. This is Jonathan Brown with me. We're from Symantec. And within Symantec, we're from a Cloud Platform Engineering team. And we are building out a private cloud, a public cloud scale, to service our Symantec products, which will then service our consumers. So today, what we're really going to talk about is what is Swift. We're going to go and say, really, how we're using Swift at Symantec? A little quick overview. And then get into growing your Swift cluster. And we can't have a Swift talk without talking about the ring. And so we'll give a little quick overview about that. Go through some storage policies, multiple regions. And we also wanted to touch upon, now that we've grown this cluster, how do we evaluate cluster performance at scale? What are the kind of tools we're using? What's the thought process we're using and kind of looking at that and tuning that? Also, how's your cluster running? Once you start extending to multiple clusters, how do you collect the data aggregated up and aggregate that data up so you can get a unified view of your cluster and cluster health? And then we'll quickly touch upon the thought process about securing your cluster. So really, just a quick overview of what is Swift. It's part of OpenStack Platform. It's massively reliable scalable object storage platform. And the great thing about this is a distributed system. There's no single points of failure. You can lose a hard disk. You can lose a node. You can lose an availability zone. If you have multi-region and you're weighted on your regions are set properly, you can actually lose a region. And still, you won't have data unavailability or data loss. And so really important in a massive scale distributed system. Also, storage capacity that grows without bound. As where customers come in, there's more and more data needs. We want to be able to quickly add to this cluster, rebalance the cluster, and grow to meet the demands of our customers and our business. Multitenant, massive concurrency, REST API, again, all tenants of building a cloud-enabled distributed application. And in a quick overview, what Swift is not for, it's not for file hierarchies or live databases. It's for really unstructured content. So within Symantec, what are we using Swift for? So Symantec is the largest IT security company in the world. And at Symantec, the security, it's really all about the data. All about the data that we have massive amounts of data collection that we pull in. So we collect data from all over the world, over 175 million endpoints. We pull in this data, we store the data, we process the data, we put the data back on storage to be then retrieved by other applications and or our customers. So with that being said, we have this data. We also, after we analyze the data, we store quarantine data. So we find bad files or malicious content. We make sure that that data is content. Again, that data gets put back on to our object storage system. Also for security technology and response data, after the data is analyzed and they need cold storage to store that data, goes back into our Swift cluster. We also service our application teams running their infrastructure on our cloud infrastructure as a service, their VMs. They're running their own databases, no SQL, SQL. They need to take those backups, they need to put them somewhere, backing those up into our object storage system. And then in some of our customers, which are internal customers, they have not only single data center requirements, but multi-region, active-active requirements. We're using Swift for that as well. Servicing our customers, that's what we're doing, but what about our own infrastructure with our cloud? Within our own infrastructure, typical pattern is to use object storage for glance and snapshots. We're doing that, which is a standard pattern. Also, within our cloud underlay, we're actually storing our logs and metric data as well in compressed format. So if we take a typical single-region use case for one of our customers, we can say we have an endpoint protection product. It's a hosted application receiving telemetry data from the public net. All this data comes in, again, we collect data from 175 million endpoints within Symantec. Telemetry data comes in, comes into the application running in our infrastructure as a service, on the applications running on the customer VMs. That data will get stored down into our object storage system. Again, a single-region use case, three replicas, three replica policy. That data then get pushed off through a security analytics pipeline where it's analyzed. The results of that data is then put back into the object storage system for retrieval. So with regards to Swift, it's leveraged as the central data collection point for both data storage and data delivery. So as you can see, really critical part of our infrastructure. We talk about growing your Swift cluster. So we have this single-region cluster, we're taking in data, we're persisting data, we're getting all these different use cases on the data. Now we have to meet our customer demands as it starts to grow. It starts to grow internally within a single region, different storage characterizations. We need different storage policies to meet the different needs of our customers. Also, we need to extend to different regions, first off for DR purposes. But what about for other storage considerations where I need data replicated in region A and region B? So as we talk about geographic scale, it's always good to just revisit the Swift ring. Again, I don't think you can have a Swift talk without talking about the ring I felt compelled. So Swift ring is a consistent hash ring. And really it's all about mapping the data location of partitions and replicas onto the cluster. The Swift rings are managed outside of the cluster and they're distributed to every node. And when you put a data into Swift, it's going to take that data and it's gonna place that data across as many failure domains or fault domains as possible. We'll start, it will try and go against multiple regions, then go down into multiple zones, servers, drives, et cetera. Object, container, and account services all have their own ring. And up to the multiple storage policies, you had three rings in Swift, you had a single object policy, and whatever your replica was set on that object policy, that was the replica for your entire cluster. So now we have multiple storage policies. Using storage policies is a very, very powerful tool and I would encourage you to explore it if you're not already. What it does is enables the support to have multiple object rings. Once we have multiple object rings, that gives you the capabilities of very flexible data management. And so with regards to that, you can set policies based on your customer needs that set the data placement locations, whether the data placement within a single region or data placement across regions. Data durability, so you can say, okay, how many replicas do I have or do I have one erasure-coded replica? Data availability, meeting your access patterns. For instance, maybe I have an application that requires fast access and I need to put that, create a storage policy that creates the data on SSDs. Or maybe I have an application that's an archival application that's a write once, read maybe never. It may be an erasure-coded policy, something I can use to store that data and then get the savings overhead. And we're very excited, erasure-coding is just announced so we're very much looking forward to experimenting with that. Once you apply a storage policy to a cluster, it cannot be deleted. And really it can't be deleted because deleting a policy could leave data orphaned on the cluster. And so you'll deprecate that policy. What that means is objects can still be stored within containers that already have that policy, but new containers cannot be created with that policy. And so this is a very good tool, especially when you start to extend your cluster and extend your object rings to where your partition starts to thin out. Ideally you want maximum, or ideal weight is about 100 partitions per disk. And so once you start to extend a policy and your partition starts to thin out less than 100, you might want to consider deprecating that policy and moving to another policy to reset the partition weight. Policies applied to containers via metadata. The key thing here is once they're applied, they cannot be changed. And this is very important because if you have an application data life cycle where you have, let's say it's some sales bulletins, sales materials, and you have five replicas because it needs to be highly available. And then all of a sudden that data doesn't need to be available like that anymore. Maybe you want to transition that to one erasure coded version because it might be important someday, but you don't want to lose it. In order to do that, you need to take the data, copy it out of the container that it's in into a new container with that erasure coded policy. So the storage policies, they're transparent to the objects within the container. And so, and as we said, if you want objects to migrate policies, you copy them out of one container with a certain policy into a new container. So we've talked about multiple storage policies, really key component to us in extending to multiple regions. So typically regions are geographically distributed. Very common disaster recovery is the common use case, active, active storage sharing between regions. Now, I say typically they're geographically distributed because it's very possible that you could have a metro distance data center that you want to create two regions just to make sure when data is stored that it gets that it uses different fault domains. Or for example, if you're creating, let's say data in Poland, data in Poland created can't leave Poland. So you'd want to have a DR site. So you would have two regions in Poland to make sure that when data is stored, it gets replicated across those two regions as the first level in isolated fault domains. So when you start talking about multiple regions, what are the key considerations that we started talking about? Now, what's really interesting here is, and I put this first as default policy, although it sounds very simple. It becomes really interesting when you have competing use cases for the storage. And really what makes sense to be that default policy? You have people spinning up applications in their VMs in the cloud and maybe they're not setting the policy and maybe they're starting to write lots of data. What happens to your system? If you have a default policy that's saying two replicas in each data center, now you're sending all that data over the win. Depends on your use cases. This is something that we talked about a lot and I felt it was worth raising. Also, we talked about extending to multiple regions replication. So we have the location of replicas with our storage policy. The number of replicas. Also, you wanna think about a separate replication network when you start splitting your cluster, especially across multiple regions. Want to enable that QOS. You wanna make sure that if you have to rebalance your cluster that you know how much data you're gonna push across the wire and make sure that you can handle that throughput. Another consideration when you're doing to multiple regions is region waiting. So you wanna make sure that your regions are weighted properly. If your cluster is, let's say your cluster is one region is much bigger cluster than your second region. You wanna make sure it's weighted properly. Maybe it's weighted so that two replicas always go to one region and one replica always goes to the other. You can balance that out. And it's really good to take into consideration as you're spinning up additional regions and then understanding not only how your data is gonna exist, but how it's gonna be placed across those regions. Read and write affinity is something that when we start extending to multiple regions, especially in the context of application performance, write affinity becomes important. Read affinity equally is important, but write infinity is something that we really paid a lot of attention to. And then we talked a little bit about replication network but bandwidth sizing. Make sure that you have the right amount of bandwidth to store that data and move that data across to the different data centers. So multi region use case, it's write affinity is really interesting concept something that we use for our applications. And what happens is application comes in again, this is an application, we have applications that receive and scan 30% of the world's enterprise email traffic each day. It's like 1.8 billion requests per day. Lots of data. So this data comes in, it gets scanned, it gets processed. What does it look like? So these applications are running. In this case, it's a multi region use case. So I have an application as context application to running in the cloud. It comes in, goes to Keystone, gets a Swift endpoint, goes to the proxy, validates the token. It will write the four copies of the data in this context that we have a replication policy to in each data center to meet our customer demands for this application. Now this happens synchronously, return goes back to the application saying success and the application moves on. Now behind the scenes, what happens is two of those replica copies asynchronously get moved across to the other domain. So important to enable this when you start to use multiple regions. If speed of request is something you're focused on, maybe it's durability, maybe you're much more focused on durability than performance. And you don't mind the overhead of that application writing across the WAN for every request to ensure that exists in the other data center. So if they were a network partition, you can guarantee that that data can be on the other side. It depends on your use case. Again, very, very good tips and tools that you can use, but it's really understanding your use case, understanding the storage profile, understanding how your applications need to use this data, which really sets you to define how your multiple regions work. And with that, I'm gonna hand it over to John to talk about performance. Thank you. So first evaluating the performance of our Swift cluster. First, there's several tests, performance testing tools out there that we looked at and used, cause bench, SS bench, Swift bench. First thing we did was, first kind of simple workload is a very simple, like for example, doing only 16 kilobyte object writes at 120 concurrent clients. That tells us a little bit, doesn't give us a full story. It's still very useful for getting baseline understanding between configurations or especially changes outside of this with cluster. Like let's say you're changing policies in your load balancer to understand the impact of the cluster and getting a very quick understanding of what those impact could be. The next thing we would do is understand our particular applications workload patterns, how much they're reading, what size objects, all this then model that and be able to replicate that load on the cluster and then understand rest of the applications and be able to fully simulate a full load that we'd seen production. Then just as important would be for us to then to inject faults, failed disks or partition between the data centers and ensuring that we can still meter SLAs. And so for some cases it's useful for smaller quick tests for completely clearing out a cluster, all the data and clearing out iNode cache and stuff like that to get very easily reproducible numbers for certain changes. So some other considerations about test benches, considering where you're gonna run it at, we're choosing to run it on our IIS where our applications are running at so we can get more realistic numbers in response times and making sure we're not actually having real bottlenecks in our network. Being able to spin up multiple test benches through automation, being able to simulate multiple applications to get understanding of impact against what each application would see. And storing, also it's important, storing all the results, the configurations and everything so you can fully audit and make sure, so if there's an anomaly that happens, you can understand why exactly this happened. And we make sure that we, every time we test, we test as close to production. We use our production automation, we go against our production keystone. We would, we wrote a small service to do that. Endpoint substitution between keystone to use the same service. So enabling the same logging and metering. For, we also put a lot of work in visualizing all our metrics and being able to understand the characteristics of individual system and help understand what we would want to give to our customers as an SLA. And it was pretty useful for understanding misbehaving hardware or individual disks at a certain level. We also, we use heat maps a decent amount to understand exactly where bottlenecks, it's really easy to tell and understand and easy to point out and ask questions where a particular change happens at. So you can really see an inflection point. For example, right here towards the right hand side of this chart, we spend at least a minute working on feasibility and operations for understanding exactly how things are running. So first, we need to understand, have all the data to understand exactly how things are operating. We need the logs, we need system metrics, including metrics about XFS and how that's performing. Swift Recon, our internal health checks to understand, make sure everything's running properly and certain tendency requirements, depending if you need to know SLA's or stuff about individual tenants. After you have all this data stored, we need to be able to consume it and we're aggregating at a higher level to understand the whole clusters and hold between both data centers and being able to show what we consider acceptable values at a high level, but we also need to be able to provide, give us the ability to dig in when something doesn't seem quite right. Go at a deeper level and also alerting if like a disk is a disk failure or something. Also useful to add this version report. Also, we leverage synthetic transactions to model our customer use cases to ensure that we are consistently fulfilling their needs. This is a simple, lower, slightly lower level of kind of data we are able to provide for understanding. And for security, Paul. Sure. One thing that just to mention about the sort of synthetic transaction, what does that mean? And I think it's really important because this is something we've done across our cloud, which is as new customers come on, we understand their use cases and model their use cases in code that we can automatically run on a routine process within our cluster. And what happens is all that data is evented on, all the stats are evented on, and they're put into our logging system, which when we can report back, we actually understand if our customers are impacted before they do. Hopefully before they do, that's our intent. And what we understand with Swift in this context is there's only so many sets of user patterns. Now there's multiple different storage policies, and then again you have different synthetic transactions to model the different consumption methods across the storage policies. But what happens is when you start to model these synthetic transactions, they become really important in understanding your customer patterns, how they're using the cluster, and then also automating those. So we actually evaluate cluster health, something we unfortunately didn't show in this presentation, which is our cluster health and our availability from our customer perspective. So based on how these synthetic transactions run, we can understand how our customers are affected by anomalies against their defined SLAs that we've provided to them. So we talk about security in the cluster. It's always interesting because from a security perspective, there's no formulaic approach to say, if you do these three things, your cluster will be secure, and you will be very happy. Security is more said a set of questions and a set of methodologies that you go about to understand that. So where do you start? How do you know if your cluster is secure? Now, the way approach we took is threat modeling. So we built out a data flow diagram and you do a data flow of Swift, but not just Swift itself, but Swift inside your environment. When you start to model the data, then you start to identify which threats do I have? If I start to look at this model from an engineering perspective, which threats do I see? How could I get into the system? You know, identify vulnerabilities that you see in the system from this process. Also, start to think about penetration testing. There's plenty of tools out there that you can use to penetration test different environments. It's good to identify these exploits and then once you start to identify them, you can then prioritize those, decide which ones make sense for you to address or not in the context of time by which you have to address them. Some of the considerations that, you know, we thought worth mentioning in this process is be sensitive with your credentials. And so, for instance, if you're using deployment automation to deploy Swift, you might want to think about encrypting the Keystone credentials that are used stored in the proxy config in your deployment automation. If you do that and then once you deploy Swift, it's then the Swift is locked down so only authorized people can go to the proxy to actually look at that config. You know, one of the things to start to think about. Also, limiting direct access to your storage nodes. If you're storing customer data, obviously, data is very sensitive. You don't want the wrong part of people to get that data. And so, you might want to think about limiting your direct access. Who can have access? What machines, what nodes, what people can have access to those storage nodes? Which ones can? Another thing that we talked about a lot is we're to terminate SSL. Depending on your customer use case, some customers might say I need end to end, not only end to end to the proxies, I need end to end throughout the entire system. Client to proxy, proxy to storage. And then over the win. So we need to start to think about this. Typical pattern is terminating SSL at the load balancer. You know, in some use cases, that's exactly what we do in our environment, but the conversations do come up. Is that good enough? And these are some of the things that based on a use case, you could start to think about. So a really interesting thing we've been experimenting with lately, kind of put this into the context of security because of the network isolation and multi-tenant networking of SDN. But really, what does SDN do? It provides an easy network isolation for multi-tenant networking. And so in our environment, all of our VMs running in our infrastructure as a service platform, they're all running in the overlay. And so Swift running in the overlay gets that nice network isolation that we can leverage, but it also avoids the overhead of, for every client request, bouncing up and down across physically or overlay to underlay and then back again for each client request so you get rid of that performance bottleneck. So how do we go about thinking about this? A couple of different ways. One is fronting your proxy servers with load balancers of service. So front end of the load balancer of the service is in the overlay, back ends in the underlay. Swift is running in the underlay network. All the client connections come in from the overlay network. One interesting thing about this is it gives the ability, you're running the cluster in the underlay, which is what we all know how to do. You also have full control of the load balancer, load balancers of service. You can scale it on an as need be basis. You can also control it in that context without having to go through IT tickets to get a hardware load balancer updated. So you can, if you're running virtual proxies, you can start to build automatic dynamic scaling of proxies. Elastic scaling and shrinking. Another interesting thing with Swift running in the entire overlay, is that you get ease of management and also that security that you get with running an isolated multi-tenant networking. This is something we're actually experimenting with. We have a 760 terabyte cluster running in the overlay right now. And we're still working on it, tuning it, understanding how that works, how that behaves, how that performs. And in that context, if you have experimented with this or if you're interested in learning more about it, please talk to us and we'd love to hear about your experiences as well. And with that, we'll ask for any questions. You mentioned erasure coding and the Kilo release of Swift. How do you see yourself using erasure coding? So for us, it's in an archival sense. So there's some data that comes out of our analytics pipeline that we'll wanna archive. And so what we'll do is we'll store that data in an erasure coded policy instead of having multiple replicas. We'll have one erasure coded replica. And that's a right once, read maybe never in that context. So we get the benefits of getting the value of erasure coding with that overhead. And we know that it's not, need doesn't need to be highly performant based on that use case. For instance, we're not having lots of reads come in to read that data. If we were, then erasure coding wouldn't be really an appropriate solution because it takes time to assemble all the fragments to give that data back to the client. So one thing I'm on a little bit unclear on is our use case is medical data. Medical data is very highly regulated as far as how many copies you have to keep, how long you have to keep it, where you can keep it. So if we had, if we say wanted to set up three data centers across the United States, two in Mexico, two in Canada, three across Europe, and we assigned a zone to each one of those and say the ones in the United States can only replicate to each other legally. The ones in Europe could replicate to each other legally once in Mexico, Canada, et cetera. Is there a way that we could, through storage policies, say, okay, these regions can replicate to each other but not outside of those? Absolutely. Okay. Absolutely, because the beautiful thing about storage policy, one of those is location. Location durability and availability. Your locations, you set the rings for those object rings to work within those data centers. That's it that you want. And actually, if you're interested in talking after this, I've actually done some HIPAA compliance on another object storage system. I'd be interested to chat about a couple things. Oh, sure. Great talk. Thanks for going over all that stuff. I had a number of things I could ask about, but one thing stood out. You guys talked about synthetic transactions. One of the mechanisms of monitoring your cluster is producing some sort of known quantifiable load that you can sort of monitor. I think Donna talked earlier when she was talking about HP's cloud. I think a lot of dispersion report is kind of one of those. But you guys talked about doing something a little more sophisticated where you have some insight into this type of workload and I'm simulating something more specific. Swiftstack SS bench sort of has those scenario files. And I wonder, I mean, how much have you guys actually been able to do and share maybe about how much can you get into, this is a kind of synthetic transaction that describes this model of workload. And then we're using that to backfill load against the system so that we can... I mean, this sounds awesome. Are you processing logs? Like how are you achieving some of that? So, John, you wanna talk about synthetic transactions? Yeah, we just simply just write a script to run a sample load that the customer use case is. And then we'll import it out to our metric system. And then we can monitor and alert based off that. We're using right now just Python and interacting directly. We're not using SS bench for that particular piece at the moment. Okay. How do you know the customer? Yeah, sir. How do you know the customer workload? So that's a great question. No, that's a great question. So the question is, how do we know the customer workload? And so part of our engineering process with when we talk to customers is we actually go through a learning process and we understand what their platform needs are, not just storage. So we know of all the different components of our platform, and then from a storage perspective, we understand not only what size files and what their throughput is, what storage policies make sense for them based on their application needs. Really, what are their burst rates and what are their applications exactly doing? So we kinda understand what their application's doing. Once we do that, whatever that application flow is from, let's say in the context of storage from a Swift perspective, we model that. So if we have a high throughput application that's taking a lot of little files, say 16K files, it's bunching them up, it's storing them down at a certain frequency, we'll model that behavior based on that use case pattern. So we have deep insight into what our customers are doing because if you remember our customers, that we talk about the word customers, it's internal semantic cloud service teams. So we have that relationship. Yeah. Is that an automated? Automated in what sense? In the sense that you can actually monitor the workload and then build the modeling or you just... You know the programmers, you know the... So, exactly. So because we're talking with internal cloud teams, what we have when we talk about engineering requirements on what they need from our platform, this is when we tease out that process. If there's additional lines of questioning, we reach out to the architects of that system and really understand and model the flow. Once we understand it, then that goes into automation of which we automate and then we make sure that we collect all the data for those synthetic transactions running that we can and once we get that data, we can analyze it on the fly and we make sure that in our aggregated view, we actually report on that data against our customer SLAs. So we know what our customer SLAs are that we've provided them based on their object sizes, based on the throughput, based on their policy and then we make sure that we meet those and if those get near that threshold, we make sure that we use a rag scale. So red amber green and we make sure that if it flips within a certain percentage of that, we flip it to amber. If it starts to get dead close or over, we flip it to red and that sends an alert and then we can then take action on that. I'm caught up today from NTT. So I guess the object encryption stuff, server-side encryption, is going to discuss in community. So how do you think about the object encryption? Yes. Data encryption at rest? Server-side encryption. So data encryption at rest is really interesting topic and I actually love to talk to John more about it and I think it's fantastic. I mean, typically the pattern right now what we do is on the application side, they'll encrypt the data, store the key with the data and metadata and they'll use that pattern. But again, I'm not sure where we are in the broader Swift community with regards to that topic, but it is something that, at least from our standpoint, is very important. It gets to really the key management. It becomes the interesting question there, far less about the storage but more about the key management and how does that integrate maybe with Barbican or other solutions? Thank you. All right, thank you very much. Appreciate you coming on. Thank you.