 So, hi, my name is Joe Arnold. I'm the CEO of Swift Stack, and today we're going to talk about Swift, right? So, our company is, we have a bunch of core contributors to the Swift project, and we provide solutions and software to help manage and deploy OpenStack Swift. So, that's what we do as a company. We have more training on Thursday. So, we have at 1.30, we have a workshop on, we're going to get down and dirty and go through all of the configuration steps with Swift. So, this is where you're going to show up, bring your SSH terminal, and we're going to start typing and walking through the process of configuring and setting up Swift. Then next, what we'll do is we'll drive some deployment automation with the Swift Stack tool. So, that's on Thursday starting at 1.30. That's, there's also some more information on training available on our website slash training. We have a book that we just published, which is really cool. It's available at the booth. There's some more copies we are almost out, so go now. Not now, I mean after the talk. But if we still run out, we'll have them at swiftstack.com slash book and I'll put a form up so we can get, we can start sending them out. So, here's what we're going to talk about. We're going to talk about Swift use cases. We're going to talk about Swift architecture, and then we're going to talk a little bit, we'll touch on Swift and how that relates to software defined storage. So, we went to go get a new, I went to go to the Apple Store to get a new laptop, and it actually had less storage than the previous generation. Because we, you know, I was hiring new people, we're hiring by the way, it's another reason to go to the booth. And, but I thought that was really interesting because here we are, subsequent generations of technology, and we're having less storage in our personal devices. I mean, and we're not consuming less data. I mean, more is being generated all the time. So, where is it living? You know, duh, it's like, it's living in data centers. And so, what's happening as these applications are coming out as mobile devices, more and more content is being generated. Users are producing more, you know, video content, user-replated content, gaming, machine-generated content, HPC is capable of producing more data. So, more and more data is being stored. Found a report online at IDC, he said, 40,000 exabytes by 2012, or 2020. It's a lot. I was trying to figure out how to represent that in terms of physical space, but I kind of lost it. But when you have that much data, the demands on it are a bit different. Because this is unstructured data, and so there's some different properties about how to store it. And so here's some things to think about. One is durability. So, user generates the content, or a scientist generates a genome. The expectation is that it's gonna be stored very durably. Next is accessibility, and even though you took that picture seven years ago, Flickr still gotta serve it to you. That data's gotta be accessible, and archive, the demands are changing, the expectation is changing. So remember, you used to have to order check images, for example, and sometime later you might get an email, or if you remember, got shipped in the physical mail. But now that needs to be available just like that. And so the expectation for consumer and data consumption are changing. Cost has gotta be low. You have business models. You'd have companies like Evernote, for example, where they have a cost structure based on data storage that has a relatively low conversion rate on their users. So the baseline cost for the storage has gotta be relatively low. And then hand-to-hand with low cost is manageability. We retune we have customers who only have a handful of operators to manage large amounts of storage. So you have to have, the concept of software-defined data centers, software-defined storage and networking is real, and it's real because of the efficiencies it drives and how large-scale data centers are managed. So those are things to think about. The next thing is about, I wish there was a one-size-fits-all storage system. You have piles of data, and you have applications which are taking an increasing volume in the data center, and it would be awfully convenient if we could have one system that could take care of everything. Don't worry, we got you covered. But the problem is, is that there's trade-offs. There's trade-offs when we do systems design. And I think that like Brewster's cap theorem sums it up pretty nicely. You can have consistency, availability, or partition tolerance. Sorry, you can only have, you can't have all three. You have to pick two of them. And so if you're building, if you're running a VM, if you're building a database, transactional workloads, then it's really important to have consistency. You don't wanna do an update and have somebody else sneak around to do an update of that data. And so that means you have to give on partition tolerance, or have to give on availability in order to create that system. And partition tolerance is the ability to have two systems that are separated from each other. So we're talking about distributed systems. And if there's a sever between those, it means that the system still can operate in that case. And so if you have partition tolerance, so Swift, for example, it's picking partition tolerance and availability. So if there's a sever in the cluster and a request can only go to one side or the other, it'll respond, it'll tell, it may be old data, but it's gonna give it to you. And so the Swift can operate in that context. And so Swift is choosing partition tolerance, meaning any bit of the cluster may fail or we might have splits in the cluster. And it's going to deliver high availability for sake of consistency. And that's just an architectural choice. And every system's gotta make these choices, but Swift architecture makes that clear. This is the path we're going down. So with those constraints in mind, I'm just gonna walk through a few use cases because it's not necessarily evident all of the use cases when you have those constraints. Well, what good is it for? If I can't build a, can't run a database on it, what the heck is it being used for? So I'm gonna walk through a few of them right now. So the first is storage as a service. And this is creating something that looks and feels like Amazon S3 or Rackspace Cloud Files. Heck, it is Cloud Files. And you can build Chargeback, you can build Web UI, you can do, there's a whole suite of file system gateways and backup tools that you can load onto it. And so you can build out either a private cloud where it's just you are running it as your own little mini version of a cloud or big depending on the use case. Or you can do a hybrid environment where you're bursting compute workloads or web requests workloads out to the cloud, but then you can keep the data in the data center. Or you can fire up your own public cloud storage service. And so we have a few folks that are doing that. Of course, Rackspace, Internet, HP, KT, SoftLayer. I think Gartner did a report of the top 10 public cloud providers, four of them using Swift. And so we've been also working with a company called EnterIT which is an Italian service provider. And what they're doing is they're building cloud storage services for their existing customer base on their existing networks. And so they can sell more services into them. The next case is web and mobile applications. This is where Swift really can earn its keep. It's where we get dragged in a lot. And so what's happening here is as applications, like we talked in the beginning, applications are being built and there's lots of users. And with lots of users comes lots of concurrency and lots of users means lots of storage. And so Swift has the ability to scale out at the request layer so you can handle lots of concurrent requests. And also, scale out at the storage layer, again, to handle lots of data. And so think of it like this. Instead of a storage system where you have one big fat pipe and I can ingest a file, you know, one giant file and it's into it, Swift operates, think of like a big bundle of straws all wrapped together where there's lots of individual streams of IO. That's what Swift is really good at. And so the use case where you're serving out web requests or lots of clients is a really good one. Here's some customers, some companies who are using and or contributing to Swift. Wikipedia is an example. What they're doing is serving all of their images, video and audio from Swift. And actually doing something pretty cool. There's the ability to create custom middleware. And what they do is they actually dynamically resize images and serve them out to users. So an editor can go in and say I wanted the size and then inside of the Swift cluster, it resizes it, stores it back and serves it out to the users. The other cool thing is that it's just, it's HTTP. It's a web interface. And so what do you do for popular content with a web interface? Put a cache in front of it. And so that's exactly what they do. They run varnish and they cache the popular content. And then that long tail of assets that need to be retrieved goes directly into the Swift cluster and if it becomes hot, well just gets hoisted up into the cache. So that's what they're doing with it. The next category is active archive. And so what we're seeing here is HPC video. And what they wanna do is box, but for terabyte data sets or very large images. And what they really want is the ability to distribute that data across multiple data centers or multiple regions. And that way, different sites or different campuses or researchers can access that data and pull it in and out and suck it into whatever processing system that they have. So that's one of the other use cases. So what is Swift? So some of the attributes here. It's multi-tenant, so lots of users. Highly scalable, durable, store lots of unstructured data at relatively low cost. And it's great for building applications around. So I'm gonna dig through just a few of these here. So scalable, so again, you can scale on the front axis here to handle number requests. You can scale at the object storage system to handle more storage. And there's no single point of failure in any of those components. Durability, there's some pretty cool durability properties. Data stored in triplicate. So there's three copies of data in the system. When a write comes in, there is a quorum write. So when a write is serializing down to disk, it's trying to do a three at a time. And two must be successful before the client gets a 200 okay, a success back to the client. When a device does fail, so if there's a single disk that say poof, goes away. What the entire cluster will participate in the recovery of that failure. And when it comes to data placement, so more catastrophic failures in a data center like a truck runs into the data center. With Swift, you can lasso together a bunch of servers and call that a zone. You can lasso together a region. And Swift will do what we call unique as possible data placement to put data in as disparate locations as possible. Highly concurrent. So we talked about, so size isn't necessarily the factor here, but it's the ability to scale up the access into it. So you can handle lots and lots and lots of simultaneous reads and writes out of the system. And pretty good at that. It's open source. And there's a ton of energy behind this. I mean, we just ran a full day of technical sessions diving into the guts of Swift. We actually had to schedule a second day where we're kind of taking over one of the other ad hoc tracks just so we can accommodate everyone who wanted to talk about their contribution into Swift. There's over a hundred participating developers. There are many companies involved. And Swift is tested in large scale production environments before we drop major releases into OpenStack. I mean, we've been running it for months and months by the time it gets into an OpenStack release. It's already been running in large scale. So, and because there's so many vendors that are participating in and running customers, there's many vendors to choose from. There's no lock in. And I think that's pretty important when you're making a decision about what storage system you're thinking of building on. And that dovetails into the application ecosystem. So, I mean, we have Koofer here in the audience and they have a great tool for building file sharing on top of it. Adam Bain from Aldivica is here and they're building a file system gateway specifically for Swift. So you can bridge between NFS and object in a very sane and sane way. And there's a whole host more of other archive and backup utilities, not to mention all the client libraries from Java, C sharp, Python. I mean, there's lots of client support for Swift. Finally on commodity hardware. So Swift is designed from the ground up to handle failure. So you can get away with varying levels of quality and in fact, we were running an experiment, Ken will remember this, where we tried to get desktop drives to put in a fraction of the storage system. And the device manufacturer said, no, are those going in servers? And we said, yes. And they said, I'm sorry, we can't sell them to you. So we had to do a covert operation to even source a few hundred thousand dollars worth of drives to put in a storage system. And it ended up, they worked fine because Swift can route around any individual component failure and that's perfectly okay. And that meant because it's based on commodity hardware, we can buy just what we need now. And then in subsequent generations, when the prices are cheaper and the capacity is more and the capabilities are more, we can grow it more incrementally and just fold that capacity into that cluster. All right, I'm gonna go through the developer features just because I think they're really cool. Static website hosting. So that means you can host static content, it can be served out directly. I mean a webpage, CSS, JavaScript, the whole bit. You can do expiring objects and that means that you can set a date where poof, deletes, goes away. So great for data retention. You can have time limited URLs. So you can basically say, I'm gonna allow any bit in the world to upload into this container for a short period of time. So it's like a temporary holding place. Quota management, direct from HTML form uploads. So again, you can build an app, mobile app or real app and you can have the user post and you don't have to go through a web tier. And you're uploading into the storage system. It's kind of neat. Versioned writes, keep track of all the previous versions, transfer chunked encoding. You can start and upload without knowing how big the file is gonna be. And it will just chunk, chunk, chunk, chunk, chunk until you're done and then that's your object and you can pull that object down. Multi-range reads, kind of get in there. Access control lists. So you can build out an application and you can say this content can be shared with this group or this group and that can be managed via ACLs. And none of that works for you then there's custom middleware that you can build on your own. Here's how it all works and we're gonna talk about the architecture. So it's not a traditional file system. It's objects. And so that means putting, it's not blocks, it's not a file system, it's REST and HTTP. And that's the access method. That's all you get. The constructs and Swift are the very top layer is an account. And then below that are containers and then below that are objects. You can have as many accounts in the system as you want. Each account can have as many containers as you want which is a little different than Amazon S3 which limits the number of containers that you have. And then under that sits the object itself. And that's the construct. And that forms a URL over here. So the URL has account, container, and an object. So that's the construct and you post into that to write the data. You get that URL to get that data out. Put, sorry I said the wrong thing, thank you. And then you post and delete and you can do different operations to set different properties. So back to that feature list a few slides ago then you would post into that to set different features on that URL. So now behind that URL, there are two major components. There is the proxy tier. And that proxy tier is sort of the grand central station. It's taking in all of the requests and then it's routing those requests to the other component which is the storage system. And so when that request is made of the proxy the request goes in. So if we're going to do it upload, the request comes in and then the write streams out to three of the storage locations. It's a quorum write. So you gotta get two out of the three. And so the majority of the storage servers has got to respond that with we got it, okay. And then only then does the client respond that they got it. And then the storage server, those can also scale up or down depending on if you need more storage. The proxy servers can scale up or down independently depending on how much throughput is needed out of the system. And then when a request is made on one of the storage items, one of the servers, what the proxy server does is it takes the get request and then it asks the storage system and the three locations that it's supposed to be for the object. And the first one that returns is what handles the request. So there's a data structure in here called the ring and it's what makes all of this possible. It's a mapping between that URL which contains the account, container and object and it maps that to three storage locations and then a series of handoff locations. And that data structure is distributed to every single node in the cluster. And it's just a data structure. I mean, there's nothing really magical about it. It's just, it's a Python pickle that's in memory. That's all it is. And so it's used everywhere in the system for routing requests in the proxy server and then in the consistency processes and the consistency processes, what they do is they make sure that the data is healthy and the replicating to the right locations. And so what the replicator does, so this is one of them, the replicator, we'll just check hashes between collections of data and we'll get more into this when we do the install workshop and we'll just check hashes between them making sure that they're all synchronized and if they're not, then there's a replication activity that'll occur. And then the other consistency process that runs is an auditor and it has a couple of ways of doing it but one is a quick pass and the other one's a slow moving pass where it actually will pick up that data, recalculate that checksum and store that. So in that way, if it finds something that is inconsistent, it'll throw it into another bin called the quarantine area. And so those are the two ways that it keeps data consistent. The next thing is that Swift can place data across different zones and so this will be used in a data center or nearby data centers where you wanna make sure that you have some physical separation or network separation between different sets of data. And then new in the grizzly release are components to begin specifying regions and regions are kind of like zones but there are different rules in how data is replicated between the different regions. So yeah, go ahead. The data will be replicated across, oh sorry, the question is how is the data replicated across regions? And so it's going to be defined on how you set it up. So there's a couple of use cases that we're addressing with multiple regions. One of the first one is a disaster recovery zone. So in that case, you'd have a primary location or you may have a full or slightly reduced replica count and then an offsite replica, say of one or two copies of the data and it's not really meant to be accessed. And then the other scenario, and we see this more with people building applications, is where they wanna have two regions and send users to either one. And so in that way, there's going to be users accessing both systems at the same time and then asynchronously each site gets up, the replicas get updated. And so that's the design philosophy behind the multiple region support. So the essence of all this is how we're building out software-defined data centers and software-defined storage. And my perspective on how to piece these things together and so when we're constructing our systems here, the way I'm thinking about it is that there's four layers and there's a layer to handle the routing and services of requests. So that means authentication, load balancing all the requests, servicing the data itself. Then the next layer is the, I don't know what the label to put on it, but the storage intelligence, which is what Swift is. So it's how does a node know how to behave? And then there's the physical hardware itself, what actually can run the, execute those behaviors, and then the controller, which because we're running distributed systems now, each node really doesn't know, it only knows what it knows. And it doesn't know about the rest of the system as a cohesive entity. So there needs to be some thing, some something that has control and has some overall cluster awareness of what's happening in the distributed system. And so that's the control component. And if a controller in this contents in running an object storage system, what it can do is it can tune and it can optimize the cluster while it's running. And it can notice when the system is routing around failures, it can see that. You can say, oh, there's been failure here. There's been a failure here. So in that way we can form an operator and they can start taking corrective action or dealing with it or replacing a node, replacing a drive, whatever the action requires. And so this is one actor in a participant for a software-defined data center where you can orchestrate capacity, storage, networking, routing, and all these services. So it's just one piece. And in addition to the work that we're doing on Swift, this is the other area that we are, that we're digging into. So I'd encourage you to check this out as well. So thank you very much. And there are more copies of the book at our booth if you didn't get a chance to get one. And Anders has got some. So see him, see him over in the corner. But I'm happy to take any questions that folks may have. Correct. Yep, commodity. Oh, yeah. Yeah, so the question is, what are the performance characteristics of these nodes? Just because they're commodity doesn't mean that they're low-end. By any stretch of the map, we actually care a tremendous amount, like we at SwissDAC care tremendous about what's inside of the box. So that is what controller cards, what drives, and if there's a performance requirement for it, we certainly will throttle that up. So thank you. I'm gonna take some more questions down in the front of the stage, but I appreciate it. Thanks.