 if you want to play along on your laptops. All right, let's begin. So this is Swift 102, extensibility and what you can do with it with OpenStack Swift. I'm Doug Soltas. I work for a company, Swift Stack, where the leading contributor to OpenStack Swift. And today I'd like to tell you a little bit more about some of the more advanced functions that Swift has beyond just create, write, updates, deletes. So the first question is, are you in the right session? If you're here, I do hope that you already know what an object is. And I hope you know that objects can be accessed through a REST API. And so then we're gonna move on quickly through those. Now, if you don't know those, we will cover it, but we're gonna move on to what middleware is. We're going to outline the functions that exist in middleware. And we're going to actually start talking about examples for applications or use cases for some of the middleware that exists today before we let you go out and build your own middleware hopefully. So again, quick recap, Swift 101. What is an object? An object's really simple. It's a file that we're going to upload via the HTTP protocol. You can access it via REST API. And the unique thing is, we're gonna associate some metadata with it. And so in this example of the Golden Gate Bridge, here's some metadata that you may want to associate with this image. How are objects accessed? Well, what most people do when they use a Swift object store is they create an object. So they put, then they get it many times, and eventually they probably delete that object. And so these are your basic CRUD commands, as you call it. You can do the same thing for the metadata associated with the object. And today what we wanna do, just like there was a talk earlier this week, is move on beyond the basic commands that CRUD's going to give you. If you're still fuzzy on Swift 101, I highly recommend that you stop by the Swift Stack booth. They are giving out the OpenStack Swift O'Reilly Animal Book, as well as the OpenStack website has a lot of great resources and tools for Swift. And the Swift 101 presentation has been given here at the Vancouver OpenStack Summit, as well as the video is available already online for the Atlanta and for the Paris Summit. So let's start off with what is middleware? Well, Swift has this thing called a pipeline. And a pipeline is a Python construct, and it allows you to intercept, request, or responses, and if needed, to make alterations to those events. And so if you think about Super Mario here, he's about to go into the pipeline, and at each of these orange points, something could happen. So maybe he'll be routed to the right or the left or up. Maybe you have to do something on your controller, but in the Swift pipeline, we're going to let middleware look at what you've sent and alter possibly, if you decide you wanted to, what was sent to the Swift backend system, and that can happen again on ingest or e-gest. So without middleware, this is what most people are doing today, even though middleware ships right in OpenStack Swift when you download it, you're gonna have a client, they're gonna access the proxy, they're probably gonna put some data, they're gonna request some data, they're gonna delete some data. It's kind of boring. It goes straight to the storage node. With middleware, we can inject at both the proxy and the storage node anything we want. We can alter your data, we can concatenate your data, we can convert it if it's a genomic sequence from a compressed file to an uncompressed file. We can index it, there's a lot of neat things we can do with middleware. So what middleware exists today? Well, here's just a small subset of what ships with OpenStack Swift, it's already baked in. And of course, you can write your own middleware and we'll get to that. And you can really build it to do anything you want. And if you want to know even more about middleware, I would recommend that you go watch the Paris Summit, there was a Swift 102 extensibility in OpenStack by John Dickinson, he's the Swift Project Technical Lead, and he really went through, here are all the different ones that exist today and he talked about how you can get involved. But I wanna get down to a different business. I don't wanna tell you what it does, I wanna tell you how to use it. And so I've highlighted a couple of key middleware features that on a daily basis I'm asked about, whether that's by end users, app owners or project developers. And so we're gonna go through a couple of examples today on how to check the proxy status, how to authenticate, how to do range reads, bulk uploads, a lot of really nice functions and features. And again, there's some associated code examples with this that you can follow along with if you download the deck. So the first thing I wanna do is I wanna talk about a Swift middleware implementation called HealthCheck. And what HealthCheck does is it's really simple. It lets you know if a proxy node is up or down. So normally you're gonna put a load balancer in front of your Swift cluster and your load balancer, whether that's an F5 or a net scaler, you're going to write some complicated script or logic to do something like try to put every 30 seconds, try to get every 30 seconds. And based on the response you get, you'll make an evaluation as to whether or not the system's up or down. But with HealthCheck, it's nice and easy. We're gonna go to a single address and if the node's working properly, it's gonna respond okay. And if not, you're not gonna get a response. And that's gonna really help when we start talking about rolling upgrades in Swift because Swift is this huge distributed platform that should have no downtime and should be able to mitigate around failure. And so what better way than if either a node fails and it does not give you that okay request, meaning that one of its services is not working right, or B, if you're actually intending to do an update, the node would stop accepting all new requests and after it stopped accepting new requests, it wouldn't take any new ones and it would change its status from okay to not okay to notify the load balancer that it's not ready to take a response. And so this is probably the most simple way that you can use this piece of middleware. If you go to any Swift proxy node slash HealthCheck, you should get a return of okay. And you can try that right now by opening up your web browser to HTTP or HTTPS and hitting a Swift node and you should get the very exciting okay message back. But there's more we can do. So that's probably the easiest example. One of the questions I get asked all the time is, well, Swift has all these great features. In fact, if you went to an earlier talk by John Dickinson, it was Swift Beyond Crud. He talked about temperals and he talked about form posts and he talked about all these neat things that a Swift cluster can do, but that's set up by your Swift admin. So if you're a developer or you're an end user, how do you know what your admin has enabled? How do you know what the max size of an object you can put out there is? We've talked a lot in the past Swift events about storage policies that was new out last year. How do you know what the name of those storage policies are? I get this question all the time. So there's another piece of middleware again that ships by default in OpenStack Swift and it's called info. And here's just a subset of things that you can get from it. So if you go to your cluster and you go to slash info, you're going to get a JSON response back that's going to look like this. Now I know that this is a bit of a eye chart. So what I've done is I've highlighted two examples that I probably get asked about the most. So if you go to your Swift cluster and again you go to slash info, you're going to get a JSON response and does that green show up well? You're gonna get a JSON response and it's actually a dictionary of dictionaries and inside of it, if you wanted to know what your storage policies were, here it is policies. There are two of them. There's one called standard replicon of this cluster. There's one called EC on this cluster and this is the default. Later on in my talk, we're going to talk about static large objects. And so one thing that's very helpful if you're using static large objects is to know what is the maximum amount of segments I can put into my static large object or what is the minimum size of each segment in my static large object. So again, without picking up the phone and actually calling your Swift storage admin or if you're using a cloud compute cluster such as IBM software, you can easily ascertain this information. Authentication is something that's really important to any step storage platform. And what a lot of people forget is that this is part of middleware as well. So by default, Swift ships with temp off and temp off is the version one authentication and it's really insecure. It stores all of your passwords in an unencrypted file on the Swift cluster. However, if you have Keystone, you can use V2 authentication and again, through middleware, you'll be able to authenticate. Well, what if you wanted to use LDAP or Active Directory? What if you wanted to use something local to the Swift cluster like temp off that was secure? Well, a number of firms out there have written additional middleware that you can license or plug in or you could write your own to enable these features. In fact, yesterday, or not yesterday, on a Monday at the keynote, there was a whole presentation on the federated Keystone. And so, wouldn't it be great if the next middleware piece written for authentication was the federated Keystone? So if you wanna authenticate and you're gonna need this information if you follow along at home with some of my future slides, it's pretty simple. You're gonna send a username and a password to your Swift cluster to an authentication URL and you're gonna get a response and that response is gonna have two things. The token, which is good for a certain amount of time to access the data and you're also gonna get your Swift storage URL. And when you see my future examples, the Swift storage URL, I reference as dollar sign S URL and the token I reference as dollar sign token. So let's get into some more advanced features. So those are like really, really basic once they extremely helpful. So Swift has the ability to do versioning through middleware. And so why would you wanna do versioning through middleware? Well, let's say that you have a container and you're using a system like Cyberduck or Expand Drive or CloudBerry and you're consuming Swift natively with your laptop. And for me, I'm writing this presentation, Swift 102. So I have a container and so in Swift we call them containers and that's where we're storing our data instead of a directory. And I'm putting my document into there and then my documents, every time I overwrite it, ends up in another container called presentations underscore old. And there's nothing unique about these names. I could have called them whatever I wanted and it's going to save each Swift document. And if I was to delete the newest version from presentations, presentations old would take the most recent version and return it back. So if you work with something like a storage made easy or another file sync and share system and your users want versioning, there's no reason for you to write your own versioning. It's already baked into OpenStack Swift. And if you want to enable it, it's actually pretty simple. There is one line that you would need to set in your server configuration file and then all you need to do is in this example, we build two containers and then I post to one of the containers with a header, so this is my metadata, X versions location, my files versions. So essentially I'm telling my files that whenever a object is replaced, it should take the older version and put it into this container. Here's a further example of actually using this code. Again, if you want to try against one of your own systems. Another extremely useful feature in middleware that you can enable by default is called object expiration. And so what this does is I put an object into my container and I'm going to set one of two header flags on it. I'm either going to do, we'll start with the easier one, X delete after and in this example that's second, so 45 seconds. So if I put a document into this container, 45 seconds later it will self-destruct. It will delete. Now does it really delete? Well no, but what happens is middleware, remember we can do this on the request or on the response. If you request this object and it has not been deleted by the auditor, middleware will check to see that it should be deleted and will return you a 404. Now likewise, there's another extremely useful command that you can use with object expiration and that's at, X delete at, and that's epoch time. So if you wanted every single object in a container to be deleted on January 1st, 2017, you could set the epoch time of January 1st, 2017 and that entire container would become unavailable, all the objects in it. So where is this useful? Well this is really useful when you're using Swift as scratch space. So imagine you're a genomics lab and you have a sequencer and it's sequencing everybody in this room's DNA and it's putting it all into a container and then from there we're going to have our HPC unit read that data, manipulate it, align it and produce a final outcome. Well all the data that we had in this scratch container is not really very relevant once I actually have the end data and so instead of writing another program that goes, you know what, after 30 days I'm gonna go back and I'm gonna delete this whole room worth of people, their scratch data and that can be very time consuming. Why don't we just let the Swift cluster take care of it for us and that way you're not wasting time and compute on a client node or worrying about really anything, let the system take care of it for you. So here's a quick little example and again all you need to do is when you put or post the object, so if you put an object you can put the header on it or if you have an existing object you can add this header to its metadata and it's x delete after 45, that's gonna delete it in 45 seconds. You'll see that what Swift does is actually middleware converts it to the x delete at which is your epoch time and after that time you won't be able to get your file back. Now in this example I play along with my previous slide of versioning and so I actually wrote three different versions of a file to one container and then one of them I tag with the delete after 45 seconds and believe it or not you can chain middleware together as we talked about in that pipeline and so I actually get the previous version. So they can trigger off of each other just like Super Mario going through that pipeline. Another great feature that people ask for all the time and is available, has been available in Swift forever is a range read. So when you put an object into the system you put the whole object in and if you do a get you're gonna get the entire object back. Well that's not always useful. What if I'm a media company? What if I'm digital film tree? They were up on stage, he very much said hey we're taking a sparrow we're writing this to our Swift back end. They used Swift, a ton of Swift at digital film tree and now we're going to edit that video. I don't want to pull that entire file back I just want a small subset of that data. So in this example you could take war and peace you could upload the entire thing to the Swift cluster and then you can send a range and say I just want the first hundred bytes byte zero to a hundred and this is actually I downloaded war and peace the first hundred bytes of war and peace. Likewise you can actually give it multiple ranges you could say I want byte range zero to a hundred comma I want a million to a billion and it will send you a mine part back with both those byte ranges. You can also do this where you can say I want minus a million and it'll send you just the first mega file. This is really useful also when we're talking about backup data if you backed up your entire Mac hard drive and your Mac is a hundred gigs and you only want one file back a range read is a much more efficient way to get that back than pulling back that entire object. Again we have a little bit of code here to show you how it can be done and with a multiple range read. Server side copy this is another great feature so I've got my Swift cluster it's up in the cloud I'm using let's say IBM soft layer or another great provider out there and I've sent my data over to my container and then I decide I want to do some modifications to it I want to copy it over to another container why do I want to copy it to another container well Swift just came out with this thing called a ratio coding so maybe the previous container I put it in was set for replicas and the new container is set for ratio coding and maybe after a certain amount of time I want to move my documents from being replicated to being a ratio coded well the only way to do that is to put it into the new container so you could request the object pull it back waste that bandwidth send it back over the line or you could just tell Swift hey do a server side copy copy War and Peace from the green container and put it in the red container and again each of those could have totally different characteristics for the containers or as we know an object should be immutable when I put an object up on the system I don't want to be changing that object if I change the object it's now technically a new version and if I don't have versioning enabled then I'm not going to be keeping my old versions and so what if you wanted to rename War and Peace dot doc to War and Peace dot old well you could give it a server side copy and again without wasting any bandwidth instantly you're going to have a new copy and then you could just send your delete command now server side copy is another one of these things that we can combine in middleware so in my example I try to give you another fun one I've got my example combined with a range request so there's nothing wrong with saying I don't want to copy the entire War and Peace I just want to take the first chapter and copy it to a new object again middleware can be chained together this is a really good function again to save you bandwidth and to save you time so when you upload a lot of small objects you're opening and closing opening and closing HTTP channels and so what if I could tar a bunch of files together and send them in one bunch and then let the proxy layer break them apart and save them as individual objects on the storage layer because at the end of the day this is what I wanted the storage layer but if these are docs in this presentation I've got right here maybe they zip up really well so I can use gzip or bzip too and I can tar these together and zip them I can do what's called a bulk upload in Swift to my container my container happens to be called presentations and Swift will take care of unzipping everything and storing them as their native documents and I was talking to a developer last night at dinner and he said, hey, even new in Swift and I don't actually have the example for this because this was at dinner last night he said, if you zip these up and you tell tar to take the extended attributes and those extended attributes for these files happen to have metadata then these objects will also have that metadata through this process and so again, here is a example of how you can do this if you leave this and want to try it at home all right, so let's get into some of the ones that people use the most and so large objects so when you go to a Swift 101 talk they tell you the biggest object you can have is the size of the entire Swift cluster I've got a petabyte Swift cluster I could put a petabyte file in there but it's a little deceptive they should say you can put a petabyte file in there they shouldn't say you can put a petabyte object in there because an object can be a collection I'm sorry, a file can be a collection of many objects chained together and why would you want to do this? Well, for one, by default Swift is set to a five gig limit per object so you can't put an object into Swift bigger than five gigs and if you want to know what your current cluster is set at remember use middleware and go back to slash info and you can get that information but why else would you not want to put a five gig object or a bunch of five gig objects into your Swift cluster? Well, it doesn't give you very good distribution of your data so if you've got a hundred disks and you're putting small objects on them and you're putting big objects on them you may see imbalances because Swift is going to divide everything up as unique as possible but not based on the actual size of the object and so you would get better throughput to break that five gig object maybe into a hundred individual segments you'll get better throughput too of sending those up but now that I have that one five gig object broken to a hundred little segments how do I stitch them all back together and that's what we're going to talk about so the first way is through something called static large objects and so a static large object is kind of special it supports range reads so again if I've got that giant movie file and I upload it and the use cases that I do want to seek within it or a backup and it's a terabyte backup and I do want to seek within it it's really important that I create a static large object and not what we'll talk about next which is a dynamic large object static large objects can use objects across multiple containers so I can create this this instruction what we call the manifest so if you think about this like Legos if these are all my objects I'm going to create a single file that tells you how to reassemble them into my static large object and by doing that I'm able to still support range reads now by default your minimum segment size so each of these objects has to be a megabyte but again if you need to know what your cluster set at go to slash info the other great thing about static large objects is that when you send it up as a static large object it has a list of the e tags and the size of every object that is expected to reassemble the static large object and if one of those goes corrupt or missing then we can return that the entire static large object is corrupt and lastly and this is probably the the best benefit of all is deduplication so let's go back to our movie example so I just uploaded star wars episode seven and there's a scene of Han Solo in it but Harrison Ford says now my hair looks too gray you know change it so I've got my movie up there I've uploaded the whole thing I do my range read I get the segment where Harrison Ford is a little too gray and I do my movie magic and I change it so now I upload that segment again but I want to have the original because we're Hollywood we never want to alter the original so instead of uploading the whole movie I simply give that one object back up there and I create a new SLO a new manifest a new instruction and I reference all the previous objects that I put up there except for the one where Harrison Ford's hair is a little bit too gray and so when you think about that way it gives you deduplication you could use this in backups as well if you're deduplicating your backup your file system into 4k chunks build a static large object and then when that changes build a new static large object only referencing the objects that changed here's a little bit of code so this is what a manifest would look like again you reference the container in the object you have to give it the E tag and you have to give it the size and bytes so that we can ensure that you're getting the right object we can give you that range readability let's talk about dynamic large objects now I see a lot of people using dynamic large objects when they should use static large objects and the reason simple is pretty easy to use a dynamic large object a dynamic large object can only be within a single container and what it does is it sets a prefix so imagine I have a logging machine so my Swift cluster is logging every single access I've got middleware that tells it send this all out to a file Swift log dot text but what I do is by time of day every hour I'm rotating this log so my log system is saving these files Swift log text O1, O2, O3, O4 into this container I can create a DLO and all I have to do is give it Swift dot log dot text and then anything that this is my prefix anything continuing on from that in alphabetical order if I was to request this object I will get the sole concatenation of all of the pieces together this is also really important if you have a video system so maybe you bought one of those cameras online and it's monitoring your house but you tell it only record when I see motion and when you tell it to record when you see motion you tell it to save the file with time of day and then you set a prefix for the actual time of day and now you can watch all the motion from the day without having to open each individual object one by one here's again how you would create a static large object in this case it would be anything in the container that starts an object that starts with SEG would appear as a single object and again when you retrieve it you retrieve it like you would any other object middleware gets in there sees what you're trying to do and returns the user what the user wants without them knowing about any of this so I don't like it when a talk ends and they go well now that you know this go out there and build something well what's something I've given you use cases and examples of what you can build but when I meet with groups like you that say I want to use middleware I commonly hear new use cases every day so these are just a few that I've heard this week at this event so yesterday there was a talk by Hudson Alpha and they talked about genomics and he said in that talk I would really like middleware that takes a BAM file which is a compressed file and aligned of your genomics and converts it on the fly to a unaligned uncompressed file because that sometimes more useful to me and I don't want to store both in Swift I only want to store the BAM file but when I request it I want to give a header that uncompresses it likewise you can do a bolt delete I didn't spend time on it that's another piece of middleware and if you want to do a bulk delete on something like everything that starts with SEG star you can't do that you have to do two commands first you would query the system and ask for a list of everything this SEG star and then you would return that to the Swift system in a bulk delete command well what if I wanted to do that in just one command it'd be easy we could string those two together or I heard another one and again from the genomics folks what if every time I put an object into the Swift cluster and it has metadata I take that metadata and I send it over to another system maybe it's an elastic search maybe that metadata is a MySQL database because that's what people want to query and so the sky is the limit if you can think of what you want to build or just talk to other people and they'll tell you what to build and I forgot one thing I forgot to tell you how to build middleware I told you about middleware that exists today and I told you about how to use it but there's a video for that too so back in Atlanta Christian from Enovance did a Swift talk on how to build middleware and during the talk he sat right down and he started coding middleware he has a full slide deck it's all available online if you so choose to go out and develop your own middleware but otherwise I hope that I've enabled you to take advantage of some of the neat functions that currently exist in middleware beyond create, retrieve, update and delete so if you're inspired to do more and you want to try this on your laptop right now you can go out and get Swift all in one it's a little developer VM that you can set up on your laptop you can go to the Swift OpenStack website and you could set up an entire Swift cluster or the company that I work for SwiftStack actually makes it really simple they put a whole management piece on it you put, throw some commodity hardware at the Swift controller and it sets the whole thing up in a nice easy Swift cluster so you can just get down to business of actually using Swift, building applications or enabling your users with a backup program a genomics project so that's pretty much it if there are any questions I'd be glad to take them now and again, this slide deck is available if you go to atDougSoltes on Twitter I tweeted out this morning if you guys want I can rewind to the very first slide I had a shortened link that you can pull it from and it will be available after the event any questions? Question on middleware what limitations do you have? I mean are there examples of the kinds of things you can't or wouldn't want build as middleware? I mean one example that I can think of that I'm not sure you can is for instance I notify, the equivalent of I notify for Swift so middleware is going to run in Python for right now and there used to be zero VM which was a way to call middleware out to another function now at the IBM talk there was in Paris IBM give a speech on something they called storelets and that's another type of middleware that you can enable for open stacks with so some bad examples would be something that's really processor intense so you're going to be doing this computation in middleware on your Swift node so let's say I have that video file and the thing I need to do is transcode it that would probably be a bad idea whereas unzipping a genomics file wouldn't be such a bad idea if I wanted to do that I would probably use something like the IBM storelet middleware to ship that off to a separate compute node or like an HPC cluster and bring that back so what you want to do is you want to be responsible and not take your proxy and hammer it at 100% CPU because if you do because of the middleware that you write you can either break the pipeline or make the entire environment unresponsive for other people does that help? sure so let me just repeat the question so the question was with the object expiration I mentioned that that doesn't actually delete it what it does is it gives you a 404 you're unable to get the data back so when does it get deleted? well Swift has this thing called the auditor and it's constantly walking the system and what it does is it opens an object it checks the hash of it it compares against the e-tag and it makes sure that the object hasn't been corrupted and if it has it puts it in the quarantine and then through either erasure coding or replicas we're going to repair it so with the object expiration that's middleware that makes the object unavailable to you but when the auditor hits it the middleware will kick in and tell the auditor please delete that object depending on your cluster and how long it takes to walk it could affect how long it takes to free up to actually get the space back in the Swift cluster question on versioning so when you're creating a version copy is it just doing a diff copy or is it doing a full copy? it's doing a full copy so if we go to the actual code example there yeah it would be nice or neat if somebody wrote some middleware that did a diff copy like subversion or something else would object expiration versioning these are what your versions will actually look like and they are the full previous object so essentially what versioning is doing is an automated server side copy right so we talked about server side copy now some other things you could do with versioning is a trash can think about an undelete trash can you have a container you don't really want versioning but you want every time somebody you do want versioning but you also want every time somebody deletes an object instead of it being deleted it still goes to the version folder or container and nobody else has access to that except for the admin and maybe you set everything in the version folder to automatically expire after a hundred and eighty days so now you've protected yourself a little bit against that rogue user that might get angry at the company and try to delete an entire Swift container you can pull everything back so again you can string things together but to the point this does not de-dupe this does not def the file any other questions I'm told we have five minutes left go back to the first slide if anybody wants the link to download just a clarification question oh I'm sorry you're using the term containers that's a different term than the other containers like a docker container yes could you clarify that so if you deal with Amazon and S3 they use something called a bucket Swift uses something called a container think of a container like a directory with a database associated to it it has nothing to do with docker containers which is a different space for actually running your code that's isolated at a kernel level what this is is Swift implements a MySQL database per container which is like a directory and that's where we're holding the metadata so when you write your object you get quorum and if that's a ratio coding that's all of the data pieces plus one parity or if it's the replica it's half the replicas plus one and so you're going to get a return that you're okay it then goes and updates a database a MySQL I'm sorry a SQLite database and so that when you want to list a directory or a container it's responsive it can sort it can do all the things that a database is very good at so in the context that I'm speaking a container that's a Swift term for basically a database-enabled metadata data directory anyone else? Put something that would synchronize Swift objects to another file system type on HDFS for example yeah so another great use case that I've heard so the question was is middleware another enabling feature to to sync things between different file systems so file systems have things like extended attributes or if you're working in a file system and let's say we're using some sort of Linux file system you've got user IDs and those user IDs aren't federated between different groups well what if you created a when you ingested an object you tagged the metadata for the user ID let's say I'm at a certain university University of Madison, Wisconsin I tagged my user ID I also create a database out there of all the user IDs a federated database of user IDs from all the different schools and then you use middleware to look up and translate when you're on say internet too and have all these different HPC clusters working on different people's data so that all the permissions and everything flows really properly so a very good question alright well I thank everyone for their time and I hope you learned something