 I'll let you say your name unless you want me to introduce you to it. Go ahead. Okay. Okay. Yep. Got a few people coming in. All right, everybody. Let's get started. See a couple of people coming in. Now, I just want you to know you don't have to take pictures of these slides because they are all available on the internet right now for download. Or, of course, this session's being recorded so you can watch it later. I'm Doug Soltes. I work with SwissDAC. This is Clay Gerard. He's one of the core Swift developers. And today, we're going to advance you further into the Swift realm. So this is Swift 102 beyond CRUD. Now, we've got a quick agenda. And today, this conflicts a little bit with Summit. And so normally, if you've seen our previous talks, we do a bunch about Swift API. There's certainly going to be Swift API in here. But we decided we'd thrown a little bit for operators. There's a lot of new stuff in the Mintaka release, 2.7. Clay is going to be talking to that. And in case you have not seen us talk before, and all this stuff is too new and too complex or whatever, well, guess what? All the OpenStack videos are available online. Clay and I have spoken before in Tokyo. We've spoken in Vancouver, Paris, and other summits, as well as other people in the community. So Christian has spoken. He's given you how to build Swift web apps, how to build Swift middleware. And so if you need any getting up to speed, this is the best way to do it. So without further ado, I'm going to let Clay take it over and talk about what's new for operators. There's been some new talks given this summit as well. So when we do this again in Barcelona, we'll have even more stuff. Yeah, so Doug and I were talking about what we want to go over in this cycle at this summit. And I was looking back at what I'd been working on, both upstream and Swift, and at Swift Stack trying to operationalize Swift for folks. And I really want to talk about some operator features. Yeah. Because we've done some stuff on the back end that has been some good improvements. And that's what I've been working on, so that's what I want to talk about. What happened there? I don't know how to use these things. I don't know. Clay, I told you it's just up and down. Up and down. It should be simple enough. Yeah, luckily they don't let me run slides in my day job. I just work on distributed source systems. There we go. See if they're up down. There we go. So one of the changes that we were working on, there was another core developer, Matt Oliver, and a few of us had been talking about this idea, was a way that we can improve some of the operations, the speeds of some operations under failure. Swift is a distributed system. It takes advantage of the hardware that you can give it, and it's going to be a fast storage system as you need it. But when things fail, there's always opportunities for timeouts and things are not going to go well. So as the storage system itself is degraded, you can, you know, that some of the clients will be observable. And it's our job working on the system to not only work around those failures and make sure that we can maintain availability, but also to optimize in those failure cases to make sure that we're still remaining as fast and performant as possible. So this is a new feature. It just came out in the last OpenStack release. It is currently not on by default in the default configuration. So as you're doing your review of the changelog and getting prepared to deploy your Swift clusters, you know, make sure that you are going and enabling that. In Swift Stack, we did some internal testing and on all of our new clusters going forward, we have some numbers based off work that Doug has done that we're going to be, you know, having this on going forward because it is an awesome optimization. But if you need to turn it on, it's just a couple of config options. And the goal is to improve first byte latencies under failure. So let's look at a little bit how that works. And again, I apologize. I'm going to get down into the weeds because I've been thinking about this stuff for the past release. Now, this is a nice, good failure condition in distributed systems, high available systems we love to think about when things are really broken. So here we have a triple replicated system. There's three primary nodes that are holding this object data. And in the first two attempts to reach that data, the node timed out. This is, you know, we'll think of it as some sort of error. It wasn't a quick response where the drive had been unmounted or some of the, one of the other operational monitoring procedures had sort of marked that offline. This has been a discovered failure that just sort of just happened. Network congestion, there's some sort of a disk IO that's hammering that particular device. And so from the proxy that's organizing all of these object servers and talking to them, when it goes to make a request to one of the nodes, it can't wait forever for that guy. It doesn't understand, you know, we don't know what's going on with that guy's disk, write, queue depth or whatever. So if that node doesn't respond, we're going to time that out and move on to the next one. So here we've got a real good failure. After that, the second node was also failed. So finally we fail over to the third and in his situation it was a very quick short response. He was able to respond and then we can return to the client. So we're working around all these failures, but the client can observe this. This is latency on the order of seconds potentially and it can be observable. So with concurrent gets equals true, we take a very similar sort of failure mode. We have a couple of nodes that are timing out, just we can't wait on them any longer. But what we do is we stagger when we start the request, this little waterfall of when we push out the request. So even though we haven't given up on that first request, it may yet still come back. We're going to sort of pre-seed the request queue a little bit and send out another request now after a little bit and see just sort of hedging our bet that maybe that second node is going to be coming back. So in the same failure condition, we've actually lowered the time to first byte that the client improves by 2x. So this is sort of a perfect storm that this particular situation is in some respect what it was built for. But when we were implementing this feature and thinking about how we can leverage concurrent gets, we started thinking about all the different kinds of failure modes that we might be seeing. So taking a very similar setting, you turn on concurrent gets true, you have a sort of few hundred millisecond sort of waterfall request pattern. If you have a little bit different failure mode, where the first node times out, the second node is slow. It's still going to, you know, there's still going to be some difference in time, but that second node is going to be slow. Eventually we get back to response, but we haven't gotten it back before the time whenever we wanted to, again, kind of hedging our bets, send out that next request. So, you know, the interesting thing that you see here is that we respond as soon as we get back that first successful request. The third request that actually gets sent out, that's going to end up, you know, piling into the disk queue and doing that read from the object server's perspective, it needs to service this request. It just doesn't happen as fast, you know, any faster than the other node that got started a little bit ahead of it. So in some ways you can think of this as useless IO. It's not perfect in efficiency, but not having the ability to see into the future. It is sort of perfect in implementation. Well, you know, as good as we can do. So here's a different failure mode. This is a great one. We have total primary storage failure. So this is something where, you know, you might see under a cluster, we're going to talk about these a little bit later, we've got disk getting full, we've had a massive partition in the network, and none of the primary storage devices for an object or down. These are typically going to be in separate zones. So this is some sort of systemic failure going on in the cluster. Very exciting stuff. We, you know, we're going to do our waterfall pattern and we're going to hedge those bets, getting out all of those requests. And then we're going to go to a handoff node. If this is a new write coming in and the previous write had also observed a similar sort of failure pattern, it wouldn't have been able to write to the primary nodes and it would have written that data into a handoff location, which the consistency engine would repair later. So there's still a good shot that, you know, we can find the data that we're looking for on one of the handoff nodes. So there's a number of requests up to two actual replica count that we'll sort of dig around into the cluster to find a viable disk that we can write to for writes or potentially based off this stable ordering, assuming that a write may have come in through that sort of degraded state, we can find it in the way that it goes. So we hold off on the last request. We never have more than replica count requests in the queue. So it's not a perfect waterfall. You don't see immediately after, you know, your concurrency timeout setting of that third request that the next one gets started immediately. We actually want to wait until one of those kind of falls back into the queue so that we never have too many outstanding requests because this is a pretty exceptional situation where you have a total primary failure. And we'll look at some of the other cases here and you see a little bit why this makes sense. We already talked about that extra request when you already end up serving the one that you started earlier can be a little bit of wasted IO. So this is concurrent. It gets true. The timeout's a little bit smaller. It's a tunable so you can kind of shape that waterfall pattern based off of your average time to first buy it, your 95th percentiles. Doug's going to look into a little bit of that. So we have some slow responses and we don't start that fourth request doesn't get started right there along that waterfall. It would not happen in this case because the second request that we started was able to get served. So there's no extra pending IO that most of the time is not going to be useful. And here's another similar situation where concurrency timeout is very low. We're firing out multiple concurrent requests. This can introduce some extra IO into your cluster. This is not the default. It's just meant to be an example. But if you shoot them all three basically out at the same time, the first one to respond is going to be the one that we come back with. But we never have, it doesn't just keep going out through all that 2x replica count into all the handoffs. We'll never have more than your replica count request pending at a time. And this is some of the work that we did. It was we were working on this object and working on this new feature and figuring out how to make it work best for folks. Yes, I'd be glad to talk about this. So we ran a bunch of benchmarks in our lab. So it doesn't have to be that you're using this for a node or a drive that's failed. This can actually accelerate your workloads. And so let's talk about the workload that we ran. We ran through SS Bench and we had a workload where we have small objects and we have large objects. And it's a very common workload that we see when we're speaking genomics or media and entertainment, right? So you have all these little models and then you render it into a big frame or you have all these little ACTG groups and you render that into a big genome. And what's happening is that when I'm writing those big files or when I'm reading those big files, my tiny requests, my small requests can get queued up behind that on a drive. And this is a perfect case to, in a healthy cluster, speed up your first time to bite with your... Quality of service. Yeah, with your searches. Our tiny objects are 4K to 32K in this example. Our big objects are 50 megs to 128 megs. We did 90% tiny objects, 10% large objects. And within tiny objects and large objects, we were doing 50-50 read-write. And so here's what this looks like. With concurrent time-offs off, the way that you'd have it in Swift 2.6, we're seeing that on a tiny object, on average, we're getting about 174 milliseconds. So again, this is a small cluster. It's three nodes, 12 drives per node. It's something you'd see kind of typically as a small Swift deploy. And so we're hammering it. I think I was doing 150. It was in my SS bench on the previous slide. 150 concurrent requests, trying to hit it as fast as we can. And we're getting an average time-out of 174 milliseconds. Our 95th percentile, right? So if you're looking at the worst results, we're getting, yeah, we're getting 698 milliseconds. And so once we turned this on, the first thing I did was I set it to 200 milliseconds. So again, if I'm queued up for more than 200 milliseconds, it does that concurrent firing. And you can see I got a 74% improvement in my average, and I got 134% improvement in my worst cases. And if you reduce this all the way down to 0 milliseconds, where it's firing all the requests simultaneously, which I think is a bad idea. We'll talk about that in a second. The benefit just gets better and better and better. The reason I'm going to say that asterisks here do not set this below, or I would not recommend you set this below 50 milliseconds, is because the IO, right? So I had 36 drives, and when I set that that low, I'm doing all these, as Clay called them, useless requests. And I actually started getting a couple of errors on the drives, where my request queue, these are SATA drives, are getting a little too long. I think during that 0 millisecond run, I got nine request time-outs, 503s, right? And that's not something you want to see in a healthy cluster. So if you're going to play with this in a cluster, you know, start high 200 milliseconds, go down to 100 milliseconds, work it that way. And the other interesting thing here is it even helped the first bytes for large objects. So when you think about a large object, there is a statistical case where a large object is going to get queued up behind another large object, and it helps with that, too. Now, if you ask, hey, Doug, why do you have in this chart the tiny object's last byte latency and the big object's first byte latency, it's because, honestly, there's about one millisecond difference with a small object between first and last byte latency, so you might as well leave it the same. Here's the same thing as a graph. And so if you look at that red and orange line, that's my 95th percentile for the large objects and for the tiny objects. The two lines at the bottom are my average latency for the small and large objects. And you can see you get a big benefit even just turning it on to 200 milliseconds. And then you really get diminishing returns up to zero milliseconds, which is why I'm really going to advocate that you don't blow up a production cluster and set it at zero right off the back. I really like this graph because it sort of validates when we were working on it and we were working through these situations and trying it out in various lab and dev environments. This is exactly what we wanted to see. We understood that there was a range here and that there was some trade-offs. And so that diminishing returns graph is exactly what we were expecting with it dealing with. Doug did all the hard work. Well, you programmed it, so. Okay. Well, so that was Concurrent Gits. Turn it on. It's faster. It's great. I want to talk about another operational thing that I've been working on. I don't know. Failure is still an interesting thing. And in this particular case, I want to talk about a specific failure, the 507. I've been involved over this last cycle in a number of engagements it feels like where their growth curves look like this. Everything is going great. We're growing. We're adding data to the cluster. They've got new use cases coming on from internal partners and stuff that want to transfer expensive storage onto the Swift storage, API-based commodity storage. And suddenly someone in some internal team finds, man, this is working really well. I want to go back, fill a bunch of old data or I want to start applying this to new applications and all of a sudden the growth curve. Because it's just API storage. It feels like it's infinite unless you're the guy operating the cluster and then you actually have to plan for these capacities. So everyone can sort of imagine exactly what happens in the next few ticks on this graph. But what you may not necessarily want to ever have first-hand experience with is what happens if that growth curve does outpace your ability to manage your cluster and bring in new nodes and plan for that capacity and get them online. So I'm going to talk a little bit sort of through the process what happens when Swift runs out of storage. So first of all, the object server in attempting to write down a object to the disk if there's not enough space to write that disk to write that data to the disk it returns immediately in 507 HTTP status response to the proxy which the proxy will observe and it will error limit that particular device and it won't try to reform future rights to it. So back when we were talking about that primary storage failure, one way that you could get into that case where new objects are being written entirely on handoff nodes instead of on their primary nodes is if all of the primary disks are full. But Swift will dig through up to two extra replica count all these handoffs and it works its way around the ring in every place that the partition could be placed in this stable pattern until it finds somewhere to store this disk. It will dig around for every last byte that it can possibly find so that it can write down that data and service that request. Swift is a system of turning disk space into HTTP 201 created status calls. It won't give up until it absolutely can't go on any further. But one of the interesting artifacts that comes out not in the right path but once you start observing these 507s which can get returned all the way out to the client. The client it could observe a 507 storage space not available because the cluster, the proxy despite its best effort cannot find anywhere else that it can fit these bytes. So then you're sort of left in this situation what do I do? What do I do about these 507s? So one of the first thing a lot of people want to think is okay I want to send a delete. I've got some old data, some test data, and I want to delete that. But people have talked about distributed systems doing deletes is sometimes harder than doing writes and the way that Swift has to implement that is really kind of a transaction. We have to write down that you wanted to delete this object. So particularly the place where things are under failure mode, we've got devices in different places, we still have to record that delete operation and in Swift that's implemented in Tombstone, a 0 byte file marker that will get written down to indicate the object was deleted and sometime after your consistency engine runs that as well can get reaped. But these Tombstone objects can get written on to handoff nodes not on the primary locations and generally in you know you think about down on the data structure first we have to write down that it needs to be deleted and then we can unlink the fat data file that actually has hundreds of megabytes of objects in it. So you have to write first and then you can unlink the primary data holder is over here but he can't be written to because he's out of space the Tombstones could actually end up getting recorded onto another device which will have to be replicated over, again it's a write, before it can get deleted. So a delete request may just create 0 byte files and not actually remove anything from the cluster. And of course while all this is going on you know replication is actively trying to repair this and any handoffs, any data that's been written out of place is going to be trying to get moved back to the primary modes, the primary devices where it's supposed to be stored. So Swift is doing what it's supposed to be doing but as an operator when you're in this situation you really are trying to figure out okay I need to come up with a better plan I gotta do something to actually deal with this situation and Swift provides a number of tools to operators so that they can make a clusterful situation, hope you never get into it, be essentially a non-issue so I talked about those clients getting those 507s, when you call the operator and say hey I got a 507 that is either a very calm conversation where he says it's no problem I'll have the new drives on later this afternoon and I'll enable merge to your replication mode or you can delete some old data or it's you know their hair is on fire and they're freaking out because maybe they are not aware of all the things you kind of gotta put together to make the system work in a healthy mode. So here's some stuff that you gotta think about. First tool that we have available is F Allocate Reserve and basically this just says that during a write I talked about how if there's not enough space to store that object it will return 507 to the proxy. F Allocate Reserve allows you to say I want to hold on to some additional space. Give me a few hundred megs in addition to the space of the object and if I don't have that plus go ahead and return the 507 find a different disk to put this on this one is near full holding a little bit back means that you will have room when a delete request comes in to write those tombstones directly over the data files on the primary locations where they're supposed to go to get things via the API to reduce some of your hold back storage space. Another thing that is a new feature F Allocate Reserve has kind of been in there for a while but the replication engine we talked about and we've learned and observed is you know it can fight against you even if the object server allows you to write something else over here the consistency engine is going to be wanting to put it back over here trying to repair that failure as it observed it. So with our sync modules per disk first of all it's just a great tuning it's a new way for people to kind of balance their I O so you can have per device concurrency count so you can sort of tune an individual drive level rather than server level you can tune it right to the drive so you can marry that closer to your your Q depth good default probably to be started for a lot of people have been running around 25 or so for a node but that means you could potentially have 25 connections going to a specific disk with our sync module per disk and the our sync module template that you can put in your object server config you can tune that such that if a single device already has four other nodes in the cluster trying to push a partition on to that device you just have to wait and come back around to it whenever slots available for you so it's better load balancing you know just to your disk I O and the advantage of having this means that you can also shut down those replicators at the per disk level and we'll look at that in the next slide the next thing that you need to think about is emergency replication mode there's a couple of tunables I kind of group them together because you're going to be using them together a lot when you're in a situation where you really need the consistency engine to be focusing on moving new data to fill up your new nodes there's a couple of tunables that you can use to sort of indicate that this is the primary work that I want you to be doing it's just a couple of seconds in the object replicator setting it's not it's not the default though because it does sort of it focuses work on one particular thing and the consistency engine normally has to take a larger view not only do I need to be moving data I also need to be repairing durability and the consistency of the processes so disabling our sync modules you know Swift stack implemented this as part of our node agent that already is sort of monitoring devices and alerting you if you have devices where they're getting too full and that sort of thing so it was easy for us to add this here everybody's got different ways of doing it there's been a number of Swift operators and we've all been sort of sharing okay how are you guys doing it how are you guys doing it and you know it's hard to be just prescriptive because everybody kind of has some of their own stuff built up but it's also nice to have examples so go to the just check it out I mean it's just a very short Python script you can see it sort of goes over all of your devices it does a stat if they don't have enough space that you want to have on them you disable our sync otherwise you make sure that our sync is enabled to disable our sync you just write down this little template file that takes that devices our sync module sets its mask connections to negative one which means that nobody can talk to that disk via our sync and of course if the disk is healthy or once it becomes healthy because even though it won't be receiving our sync connections it can still push data out be the our sync client you want to make sure that that configuration file has just been removed you just unlink it just get rid of it and if it's not there you know that's fine move on to the next guy so not necessarily that you would be running this script I'm sure you can come up with something more eloquent but that's the general idea mark the our sync disk module as map connections never want a negative one when it's full and pull it back off whenever it's ready to go now this allows sort of you see how it all plays together so the first step that you got to think about is a new rights coming in we're going to reserve some space on this full device here so the first three are our main primaries user rights to the proxy the proxy rights to the nodes the one that's full or getting full is going to return 507 and the proxy is going to write that data down somewhere else the next step that's going to be going on in the cluster is we're going to be trying to repair that right the primaries want to fix the fact that the other primary that's not holding the data should be and we are not going to let that happen we're going to disable those discs so that we can not like end up backfilling and working against ourselves because we know that what we've got coming the real solution for Swift is to add new capacity and when this new node comes online rebalance partitions are going to get moved around the emergency replication mode ensures that the primary work that the consistency engine is focusing on is pushing the data off of our full discs onto our filling discs and that is how you can make a full cluster kind of a non-issue awesome so I'll just speed through we got a couple of cool new things to talk about also so those are definitely the best ones the high ticket items that's where we like Clay talk about them so one thing that Swift has always been great at is having multi region right so you have a single name space you've got these links but they're highly latent this is something that eventual consistency totally helps with and yet we've gotten a bunch of support calls at Swiss stack where people haven't configured it right and we want to make sure the whole community knows how to configure Swift correctly for multi region so you set up a region in London Australia and in the Americas and the next thing you know customers are calling because they're getting high latency on their puts their gets on everything so the first thing is really obvious everybody probably already knows this you need to set up read affinity read affinity says that when I come to a proxy node in my region I should look first for an object in my region and not go across the wire now of course if it doesn't find it there because it hasn't replicated so far it's going to go across the wire now the second thing you probably already know is we should use right affinity right affinity is a double edged sword because it is saying that when I write data all three copies are going to go to the same region and then the replicators going to actually move them to the other regions again read affinity would still be able to access that if you're accessing it from another region where it hasn't replicated to yet however with right affinity you do have to worry because right now for this small moment in time I have all three copies in one region and if that region you know fell off the face the earth I don't have the same durability I normally have in Swift but what you probably don't know are the next two settings and some of these came out in the Tokyo release and we didn't have time to talk about them so you set up read affinity you set up right affinity and users are working and they're much happier now but every once in a while you get a user coming and they're going hey my connections are still really late and why on every put get etc well when you think about it the way we authenticate in OpenStack is with tokens and tokens are stored in Swift to cache and speed things up and that's in a memcache pool and if you have your memcache pool across all these regions there's a chance that I'm sitting in London and my token is being cached in Australia and that's not a very good thing and so what you want to do is you want to start taking your proxy servers you want to alter your proxy server dot com and you want to set up the proxy server to only use memcache servers within its own region and so now your customers are much happier but then every once in a while you get a customer coming to you and they go hey you know I've set up storage policies I'm writing to data to write to London and it's only writing to London my objects are just in London I'm not using the full you know multi region settings and yet I'm still highly latent what's going on and here's what you have to remember the objects system is able to use storage policies and storage policies if you don't know we'll talk about later but they allow you to control the durability of your data the geolocation of your data and the tiering of your data and so if you've picked the geolocation of your data to only be London well the account container which is our database layer does not abide by those storage policies it always replicates across the entire region and so there's a setting out there for async container updates where essentially what happens is the database in your region gets updated and the databases in the other regions get asynchronously updated by the object server and so your customer is always getting that fast 200 on their puts it really just affect you on puts and updates or deletes now so there's our hey this is how do you you should make multi region really work for you another quick thing that you want to talk about for yeah so this is something that we've been working on this is another long when the post operation something that Swift provides to perform metadata updates to objects that you have in the system the original implementation notated the transformation near the object down in a metafile next to the data file which will allow you to override or change some of the metadata that's associated with that object it can be useful if you're doing some sort of indexing you need to keep some metadata with the object but it didn't update the container listings which one of the things one of the pieces of metadata that they include is the content type and since there was no update to the database layer the container listing layer some pieces of metadata that you couldn't change on an object so later a new implementation that wanted to sort of make that a little bit more consistent was to transform the post operation into a server side copy which is exactly how S3 does metadata updates but it's cost intensive on the server itself it has to actually read all the bytes do any transformations that it wants to do the metadata in there and then store it back down so we were never really happy with this but post operations weren't a common thing in most workloads but in workloads where post was an important thing there was still a tunable the object post is copy which you could change it to false and you'd fall back to the old behavior with all of its restrictions so one of the things that we were working on in this cycle was unifying those two implementations so you get the same behaviors and expectations from the client side you can update whatever metadata that you want on the object but removing that side distraction of it actually being transformed into a server side copy and therefore being slower particularly on a large object you might want to upload hundreds of megabytes and then just do a quick little post request to add some additional metadata to it now you could have sent that metadata with the object itself but if you're calculating different kinds of checksums besides the md5 that Swift already keeps for you you may not want to know all that information until after you've read it and so you want to just have a little quick way to notate some information on the object after you've uploaded it so now when you set object post as copy to false in Swift 2.7 you get fast post which has all of the behaviors and all of the good stuff that you get from post as copy but with none of the bad stuff where you have to transfer a lot of bytes I would say you should use it as the default use it every time so the Swift community is very theory about we want operators to have time to migrate to new settings we don't want to change or make any surprises for anyone but this one of the sessions that we're talking about now is okay what's our deprecation cycle how can we get rid of post as copy there is just no downside for post as copy false alright so let's move our talk on a little bit to talk to the developers in the room right we've definitely spoken to you guys before and we want to tell you about new updates to the Swift API so the first one is last time Clay and I talked about ranged SLOs and that was a whole new feature in the Swift API one of the new things in the Swift API as well is that now by default SLOs can allow for a one byte segment so previously we had this slide up here and we said hey why would you want to use a ranged SLO well I've got three different objects these could even be SLOs and I want to make them into a new object and so I can use these ranges in my new object but the caveat was none of these ranges could be under a megabyte well guess what now they can now I do have one warning to throw out there and that's if you are going to use ranges under a megabyte by default now in 2.7 Swift we are going to rate limit that rate limiting works like this it says ranges under a meg we will only deliver one segment per second now of course your admin can change that you can change that to deliver five a second or you can actually set that minimum size from a meg down to half a meg but I did want everybody to know that this is now yet another feature this is actually the third time in a row that we've talked about new improvements or new features in the API for SLOs it's really been turned out kind of an interesting feature and it creates a very small little blob data that really didn't exist by comprising it of other things right another new thing out there so everybody knows that Swift can list files we can do markers we can do things like that and now you can do reverse listings and so the use case for this is generally log files right you're putting log files in they're segmented by date you do a listing and it's coming up in what you would think of as reverse order and versioning objects and so it's really simple when you do a query parameter you do question mark reverse equals true and guess what you will get your objects back in reverse order and if you need to know more about the query parameters in Swift John Dickinson the PTL gave a great talk back in Vancouver and if you look this up you will learn about markers prefixes, delimiters and everything else you could want to do with query parameters now I have a wag of my finger at a couple of the developers in the audience we talked almost two years about Swift announcing storage policies I spoke in Vancouver about how to implement that and put that you know query them put them into your programming they're an extremely powerful thing they allow you to determine the data durability as a end user or an app developer the geolocation in the tier of drives and I have seen very few applications on the market giving the end user the customer the ability to pick their storage policy and so what's happening is it's just going to default so back in my example earlier when I talked about multiple regions you need to expose that to the operator to the end user so if this is you know your backup program you're using backup or convolta or whatever there needs to be that drop down I took a quick screenshot of an application where when you create your container you know it gives you an option of all the storage policies out there and again this must be done at container creation you cannot do this the audit is go through your code find a place where you're creating a container and think about if that could be pointed at a cluster where there's actually different reasons why they might want to use it for storage policy and figure out a way to give that option to the other thing that developers have hit me with a ton is been gee object storage is really slow like I don't like it I use it and it's terrible and I'm going well I'm using it right and so one of the things we want to make as a again a performance knowledge transfer is that object storage is highly latent and what you need to do is not treat it like a filer or a block system where I do request and then wait for that to come back in another request another request when I work with developers and their application runs slow it's generally because they're doing one request at a time and there's a huge amount of latency and so the thing you need to do with object storage is to fire off a ton of streams of connections and generally when I work with anything I always start with 10 streams or 100 streams object storage is perfectly great at handling that and that's not something you normally would do with a filer because that would be a terrible idea and so I can show you right here that on both puts and gets if I have a single connection and these are 128k objects so kind of small objects I'm getting pretty poor performance but doing the same thing with 100 connections is giving me a 3040 objects improvement in my speed and my transfer rate and then if you combine that with next slide with larger object sizes you get even more advantage and so this is the same graph right the same information I just show you but instead of just having 128k objects I'm showing deep performance with 2 mega objects and you can see on the right with a 100 connections transferring data I think I was using get put as my benchmarking tool 2 mega objects on my get side I was almost maxing out a 10 gig pipe and this was again against that small system I did the other benchmarking with 3 nodes, 36 drives and I was limited to 10 gigs that was all my proxy had going you know in and out of it the last thing I want to bring up is some best practicing best practice around sharding and so what sharding well I've seen a bunch of developers treat swift as if every container is infinitely deep and I'll tell you why even Amazon S3 the containers not infinitely deep they've got a knowledge based on there and they go here's how you need to name your objects if you want this bucket to be infinitely deep it doesn't matter if it's swift or seph or anybody doing object there's no such thing as an infinitely deep container and so it's up to you as the developer to create multiple containers in fact when you think about it swift can have and each account could have millions of containers and we need you to code your application so that it doesn't try to dump everything in just one container and so there are a couple different ways to do this a number of applications I work with so people that do document management systems that they're already creating I give you a document it's called cat dot you know Excel and it's about all my my favorite cats and you put into document management system but you you rename the file you put it into a container and you've gotten a database that cat dot Excel equals you know this this new name if you're doing that it's fine to do what's called a fill and spill so you decide that you're going to maybe max out at a million objects in a container and remember you can always head a container to find out how many objects are in it and once you hit a certain amount you create a new one and a lot of document management systems have this already because they're used to juke boxing with DVDs and they go well when's the point whatever it is seven gigs of storage is in this this directory I'm going to spin off a new directory do the same thing for object storage now if you want to do something where operators or end users would actually be able to look in the system now you need to do categorical and so categorical is make it intuitive so if you have a log base system maybe you create a new container each day with the date and you put all the log files for the day and that maybe if you have a video surveillance system you do the same thing but you also append the camera name to it because again we can have millions of objects and it's much easier to clean up a container like that you want to take a directory and put a hundred thousand entries into a directory because windows would not have a good day and just because we can put 50 million objects in a container doesn't mean you should file sync and share same thing certainly build you know accounts maybe one for each user and allow them to instead directories have many containers or if you have to have one account and have containers for each end user when you think about the data model and lay it out based on something that makes sense that's normally the best way but you can also use consistent hashing it's just really important for everybody that we make the scale here's a quick something we ran in the lab this wasn't on that same system I spun up a couple instances so millions of entries in a single container down there and I think this goes up to like 50 million entries and so what I was doing is I was just testing the AC layer and so I started off getting a rate of around 300 puts per second on the database side right and as I kept putting more and more objects in it you can see that the database for the single container started to slow down and remember that in Swift every container has its own unique database which means if you had 10 containers it would be 10 fold the amount of performance now I was using an SSD drive certainly there are environments out there that are running account container of the database on a spinning drive and the spinning drive is going to have much worse results here I mean we've certainly worked with developers who put 100 million objects into a single container on a spinning disk but what you're going to see is this graph keep trending down towards zero and so at some point in time you're only going to be able to do 3-4 puts per second do yourself a service in charge the biggest containers are normally the ones that are filling up the fastest so you're really trying to maintain a high request rate and it's going to drop off after a point well Clay we are right on time if there are any questions hit us right now and if not I think they're going to kick us off stage feel free to catch us afterwards we have a question amazing talk guys the concurrent gets is really exciting just wondering what can be done in terms of improving put performance I know that we only return once all the replicas have been written right so is there anything that can be done there to optimize that path there yes and actually some stuff is in we didn't get to talk about all the great stuff that we built in the talk in so two seven so one of the things you get we have a post quorum time out you don't have to actually have all three replicas return success you just have to have a quorum of replicas plus a small time out you can tune that in essentially once two of three has responded success the proxy is going to ultimately respond success it doesn't absolutely have to wait on that third one to be completed because it already knows it does have enough to indicate success you wait a little bit longer because it helps things appear like more consistent congealed but that's a tunable and so you can bring that guy in if the final get response on a put request is some latency that you're looking at another thing that changed into seven one of the component one of the jobs that the object server has to do while rights are coming in is maintain some data directory data structures that help optimize replication has to invalidate the portion of the namespace that is going to need to be most most quickly evaluated by the consistency engine that was in line in the request right along with the F sync and potentially the synchronous container updates so you can tune the container updates down the data structure is actually now a much quicker lockless append-only file we just tell it that hey the next time that you evaluate this data structure you also need to apply these changes which have accumulated so yeah performance is also something we're working on so I would say that the biggest thing is if you look back at and this was in the Liberty release but we didn't get to talk about it the a sync container update so right now when I write an object I need quorum so if I'm making three copies two of them have to land on disk and within a reasonable time two of them have to be updated in the database layer and so saying that the database the account container asynchronously can be updated is going to allow as soon as it lands on disk for it to come back now depending on your database like we were just talking about that could be your bottleneck if you've got a container of 100 million objects in it certainly decoupling it allows even for the database to work more efficiently in a batch mode so take those region settings that I gave you and apply you know at least that one even if you don't have multi region the post quorum thing that's off by default like it's not is that like it's only controlled by its timeout setting so by default it may be the case that you observe that generally speaking that third response happens before the timeout would have kicked it out I believe it's something on the order of seconds so if you want to bring that in you really need to return as quickly as possible the async container updates is maybe the first tunable that you look at the post quorum timeout in the proxy server is going to be the next one that you want to make sure that you're bringing that stuff in as tight as you need it and the post quorum timeout in the proxy we've got the settings on a number of these slides if you look at them afterwards and again they're all available for download right now last thing I've always wondered about Swift and too lazy to check myself is between the proxy and your storage nodes does it do any kind of HTTP keep a lifestyle connection pooling like keeping connections open no most of that we leverage the kernel's TCP stack there's a number of prescribed TCP tunings that we want to have those sockets open I mean it kind of depends on the cluster layout how many different guys that you want to be connected to to the ports but it doesn't we don't do it we just tune it down at the socket kernel level there's some stuff there's a paper out from Intel on those tunings especially when you're doing a racial coding the network tunings have sort of carried us this far and it ends up just being a prioritization game of all the other places where we can really make a difference in our application code versus sort of optimizing what the Linux already do but if you hadn't seen those before I mean does that answer your question? Yeah, it does. Great, well thank you guys very much. Thank you.