 Okay, good afternoon My name is Donna McCabe I'm from HP's helium team and I'm gonna be talking about Swift my colleague Lorcan Brown is with me as well So we're going to talk about Swift in the public cloud. So just move on a bit so that the for agenda what we're going to talk about is Some things about the background of Swift in the public cloud We're going to describe our a monitoring environment And then we're going to describe some of the operational procedures we have and then hopefully at this time We'll we'll have some questions and answers and I Should say as a preface. I'm going to assume that You know Or you know what Swift is and how it operates So there may be some terminology that I use that you you won't understand If that's the case then just come up later and we can discuss it in more detail So the people on stage are here. We work in the Swift service team So we're a combination of developers and DevOps people There are other parts in in the public cloud. There are other operators There are other people involved so there's tech ops and these are the people who look after the the core of the servers There's data center operators So there's people who go around the data centers pulling out drives replacing them taking out servers Moving cables around the place and then there's a network operation center and that that's they're involved In controlling the whole cloud they monitor the cloud and they they figure out if things are going wrong These teams are dedicated to or sorry are not dedicated to Swift They're they they operate the cloud generally. So that's all the services that we operate Whereas the Swift team is specifically for Swift And of course backing up the Swift team is the the open-stack community itself that the core Swift developers and That's where we get Swift and we'll talk about Very minor up modifications we make So we this is our view of what happens in the cloud and how we manage it So we've deployment we've monitoring an operation So some of those are the aspects of which we're going to go through in this talk The first part is around monitoring. Oh, sorry. The first part actually is about Swift itself So in in the public cloud we actually we operate two data centers We've been doing this for over three years now at this stage. We've 18 petabytes of raw storage at three and a half billion objects And numbers of servers with hundreds of proxy servers a little over a hundred and we've 700 storage nodes So we've about 8,000 drives under management The features we use the standard pretty much standard Swift so we're like everyone or most people were using three replicas And we map this to three availability zones and I'll talk about how that affects our operations aspects in a minute We have a single storage policy Currently all our pretty much all our drives are relatively Homogenous in the sense that they're all roughly the same type of drive. So there's there's no Ability to give the different storage levels. So that's that's why we have a single storage policy It's predominantly upstream Swift except we also have a content delivery network bolted on the front and With a few odd and odds and ends for example, we have to support when we started this because If you look at the how long we've been operational Initially, we didn't have keystone in the in the environment. We used SW off and then we were placed with keystone But we lots of accounts that have been created. So we continue to support those accounts So to give you an idea This is the type of server that that we're will be deploying in the future or in the least in the medium term Most of our if you look there one of the mind one of the details is that we're looking at six terabyte drives and obviously Bigger drives that will be soon available or already already available Predominantly though what we haven't deployed at the moment are two and three terabyte drives But if you look at this example here Of what we're going to be deploying. So this is a front and a back view So the front view showing servers are sorry disk drives and then the back you this fans and other support When you look at a single rack like this you get a hundred and six 1,620 terabytes of raw storage and I put the terabytes in raw in quotation here because this is this is what Dismanufacturers give you and of course they use decimal When there's sizing things so if you start doing taking that into account you take a replica count of three You you got a usability factor of about say 80% So one of the things you don't want to do in a swift deployment is actually literally fill your system because your system will then become relatively unmanageable and Actually have anyone's familiar with as a parameter called f allocate reserve that your friend that will stop you filling up So you should definitely use that but you look at that those factors take three racks You're looking at one petabyte usable in that that's a state size configuration. So three racks Most but we've more racks than that because we are using smaller drives. I Mentioned we use three zones and this actually maps very nicely. We built out three failure zones and when we talk later About how we deploy the system, you'll see how that also helps us in these three zones Everything is replicated. So we've low balancers that front-ending the system Those low balancers can send requests to any of the availability zones Obviously, there's power cooling, etc Is isolated each to the zones and they all have their own networking, but they can talk to each other So we can operate if we lose one of these zones and not have any downtime That is actually yet to happen. We've never lost in hardware terms a complete zone Okay, so what sort of stuff can you expect to have now as operators you will have a different characteristic This is this is what we see in our public cloud So this this is kind of dedicated by that by the our users You're who range quite a lot from there's all sorts of different types of users in our cloud Many of most of which we don't we're not aware of like I personally only become aware of any any given user If there's a particular problem with the user, so unfortunately, I never hear the success cases. I never hear the problems However, there's some takeaways that are interesting most objects are small The bulk of the objects are going to be in the range of one to 64k So that's actually this whole brand here a very small percentage a kind of medium-sized big Bigger than 100k, but it doesn't even show up on this slide Point zero one percent of objects are what you would what most people will consider very large However, you get a very different picture if you look at it from a space usage point of view Most of your half of your objects are those small objects I talked about but that zero point one percent of your objects Occupy nearly half your space Okay, and then we get a very small fraction This is an useful piece of data to know because if you're sizing that About if you off of 1% of your total storage base would probably taken up by accounting conservant container databases So that's that's what we we see Okay, the sort of operation to expect this is operations per day I've normalized this to one petabyte of user data So if you have a petabyte of user data under management, this is you may have more storage there But this this is this is the amount of data you have You can expect a little bit more gets than puts and then a large You know we're talking about millions of deletes and then you get other operations and they call head operations Container listings are those sort of operations these operations on the left Put in the gets contained data So they payload and the operations Looking at it from a size point of view What you can see is actually that that puts dominate so in at least in our environment and gets don't One thing I observe it's not easy to represent here is that Some percentage of those puts actually get deleted fairly quickly not sure why but that that is a characteristic Remember the previous slide we are looking at a 11 One million or so deletes a day And the other thing is the amount of data is not that great So we were talking about 34 million operations a day, but you're only looking at like 2.5 terabytes a day So that that kind of backs up my contention most of those objects are small Okay So that's what the cloud looked like from our point of view So how do what what is it from a sort of an operational point of view? So I'm going to talk a little bit about how we monitor the cloud and I guess I should preface this talk by saying that Swift is very highly available and very highly resilient In fact, one of its potential downsizes. It's too resilient So you can actually build up a lot of problems in it in a data center unless you keep unless you're on top of them You're not going to be aware that they're there until something goes wrong Finally goes wrong So you can lose many many many servers and even full availability zones and nothing happens apparently and then things go crazy So as a result we've been do we develop various tools So one of the things we do is we we we from an external point of view So that we actually kind of come from two two approaches one is looking from the outside and then we look at the inside So for an external we've developed this fairly relatively simple tool. I'll swift up time on So one of the things about swift is the way the ring is built that if you do several hundred certainly to several thousand requests Even in a very large data center So if an individual user is doing a request and presumably they're all two different named objects They're going to visit all of your servers in a relatively short amount of time There they may visit and they're going to visit a large percentage of your disk drives So if there's any problems in that environment that they actually will see it most of our operations will work fine So ninety nine percent of their operations will work and then one operation will be extremely slow That can affect their overall performance. So this is one of the reasons why we watch we watch this So we in it in a given cycle. We do hundreds about a hundred requests So over a lot over several different cycles. We'll have visited pretty much every drive in the system We measure and log Soft failures, which is any failure at all and then what we call hard failures. So swift is a cloud Applications probably the wrong word, but it's a cloud service. You can expect to get a failure Okay, we we need to do things like upgrade software So we have to be allowed to give you one failure every now and again, but a hard failure is different So this is a hard failure after retry we also might not manage attract the latency and We out we look at the average We also look at the max so that we want to make sure that okay the average is fine, but occasionally are we seeing one of these higher ditches and We have an ability to chart those as well The problem with the the monitor we just showed you there is that it's it's inside our data center So it's not actually getting seeing things from an outside world point of view So one of the things we use is for example, we use Pingdom is a good way looking at this type of data I'm showing here two years of data It's I'm not hugely pleased about the number of down times we have so I went through there's 16 outages So I look through those in detail and most of them are actually DNS lookup problems of some sort They're actually not failures in our service though happy enough with that But from our point of view that that's still not great But so we're looking over two years of 99.998 percent. So overall, I think we're happy enough with that The other thing we run continuously is smoke test So we're checking on the operation of the system and this this will become more apparent We this this happens all the time. It's not a regression test So it only takes a few minutes, but it becomes more useful during deploys and Lorkin will talk about that So I'll skip on to that. So that's looking from an outside point of view The other thing we do is we've been extensive set of tests Monitoring the system itself. So there's there's lots of obvious things you can do So lots of people are familiar with this this Swift recon tool and asic pending is kind of a number of people often not look at This can be a very good indication whether there's something wrong in your system if you see this number growing as I said I'm not going to explain what that means in Swift, but I come to me to talk later And the obvious thing you will need to do but then over the over the years We've developed sort of less obvious things like for example Nick's speed So it's okay It's all very well to say and Nick is working but you need to make sure it's running at the right speed because when it's working Kind of unfortunately it actually will operate normally. However, it's slow. So it's going to affect your end users We also look at things like I away times. I'll actually have a little slide on that next But there's lots of other little features We need to check out and get generally we've a philosophy if we find had a problem in the system of any sort We will develop a monitor and deploy it. So it's very important to try and nail down every source of possible problem And make sure you know what they're about. I Mentioned this a bit earlier people are familiar with prediction predict fail That's the smart logic you get on drives so the drives can tell you that they're going to fail So we have deployed that but in addition to that that doesn't always actually work There's a tool that's available collectile is a public domain tool from one of my colleagues mark seager and But we look at IO patterns so what we're showing here is actually I don't even know what the units are But you can see it's a very large number of the operations completed in a very very small amount of time And then we bucket it long so eight operations took longer one operation took even longer I actually I don't know what the units are but they're very short periods But as you see this guy here is clearly different than the other drives This is I can be a good indicator that things are going wrong with that drive So that's a normal operation that would be that would be fine It can be a bit difficult to what interpret that during very high periods of replication when the drive is for example after it's been replaced you would see Long wait times as well because the thing is very large Most of our monitoring data gets viewed in a single so people are kind of familiar with that But that's what we use and we also have a dashboard for metrics And you've already seen kind of that this can be useful for trend spotting and we we tend to show the last hour in the last seven days just gives us an idea of what's going on and So that's that's our monitoring So actually I'm gonna hand over to Lorcan Lorcan Brown. He's going to talk about the operational aspects of the data center Thanks, Donna. And so I'm first going to talk about the swift from book and What is the swift from book and it's basically a set of a compilation of routine operations operations procedures and instructions instructions for anybody who wants to operate the swift system and Why do we have this from book and Well, essentially as a service team and put it quite bluntly We don't really want to be doing all the day-to-day operations on a system We have other teams who can help us with that but These other teams don't share the same expertise as us so in order to bridge the gap and we have a runbook, so When other teams are there such as the tech ops team the data center Operations team and the knock team these guys see a problem or an issue they go to the runbook They can sit they can fix it for us. So it removes a lot of our day-to-day work on this on the cloud system And I suppose yes, so we populate it and the other guys consume it and the two important things about this from book is That it's continuously updated. It's very rare that a week goes by when we don't do something in this from when we don't do Something in this from book. We don't make a change Whether it be an issue that's cropped up that we need to document the procedure how to fix it Or whether we add a new feature and add a new check and just explain those checks And the other thing is whenever we add procedures we try our best to automate them whenever we can so for example if we Originally we when we had to replace a drive and After a failure, we needed to first partition it at a file system label it mounted So it's ready for swift consumption And so this was put in the runbook and it was a long long process and eventually over time We automated this we deployed the system and we replaced these runbook instructions with just run this script and Other things we have in the runbook are for example there We've cmdb content management database which keeps a constant record of all the all the all the nodes that we have in our system and monitoring results explanation Let's don't have pointed out. We have a lot of monitoring tools there and we a lot of diagnostics And where the service team may understand what the output is and know what to do when they get it The other teams who aren't who don't who aren't exclusive to Swift So they have other services that need to worry about don't have the same expertise and they need to know when I see this alert What do I do? What procedures can I take what when who do I need to inform then there's other things like the basic system checks is swift Up or down and login interpretation and in the swift logs just what to look for what what certain log patterned mean And just an example of how we use the runbook and how we collaborate with them The other teams are with them on recoverable readers and this isn't something that's exclusive to Swift at all It's once data has been written to a device and a sector device can no longer be read and it's something that's becoming more and more prevalent in devices with With the growing size of these drives and in the context of Swift we see them in two ways and Firstly all we first of all we see object and your ease which Swift which nicely and Swift We'll fix these for us using the replicator and the update and the auto turn whatnot, but more and Which is more important to an operator. There's also files to me worries and These are your ease and which there is a tool in Swift called Swift drive out which will scan the kernel log and find these Your ease and then inform the operator the operator themselves need to manually change this and So earlier on and early on in our public cloud days We enable this tool and we came across a few issues obviously when we enable it as well We had the runbook procedures entered in so and there was quite a few things to do when you saw your e So it was enabled we added the output we gave the output of drive order to our monitoring So an alert came up whenever we saw some some your ease and but a couple of issues cropped up pretty quickly And first first of all was we were getting feedback that when your ease popped up And it was taking a considerable amount of time to fix them and the procedure was just too long And it was taking up too much time of the knock especially and They weren't electing other services and electing other issues in Swift and the other thing we noticed was when we did fix a ure The warnings weren't disappearing immediately. The alert was still staying there and this turned out to be an issue drive audit where and When you're looking through the logs and it looks back for a certain period of time in this case in our case It was 24 hours. So if you catch a ure straight away and fix it. It still remains in the log for the next 24 hours and our solution to this Was to kind of automate the whole process and kind of improve the efficiency of how we read our How we read the output of Swift drive audit so and When it ran it would give us a list of your ease and what would happen then is we'd script which would check these your ease and And try and read them and see where they were they in fact your ease or were they bogus And if they were found to be fine and could be read they were ignored Otherwise we put them into a temporary broken ure file and this if this broken ure file was present then we'd I'm monitoring to pick this up and give an alert and If there was an alert there then we have to improve the automation of this So we wrote a script we deployed it into our system and we updated the wrong book So instead of taking 10 steps to do it. It was just one the script script read the broken ure file and fix it all up for us and While we're on your ease I suppose it's probably something we see an awful lot of day-to-day an Average in our system. We probably come across over one ure per day And which translate to approximately five drives with your e per hundred drives per year Which is quite substantial and and if they're left untreated they can often lead to discolours And now I'm just going to talk about how we deploy in the public cloud and Swift and this diagram here is just kind of the Pipeline of our development life cycle and Swift and so At the start at the beginning when a developer needs to make a change he or she has to make their code change or their configuration change Or their feature add they have to put it onto one of our many development systems and make sure that it's okay And these development systems they all have smoke tests associated with them And they also have a single running so we can make sure that Once the changes in there it doesn't really affect our smoke test should still keep working and And our diagnostics and our monitoring should be fine once this is okay, and we the developer test that the change actually works and He or she pushes it up for peer review and it gets added to a code base ultimately and then after a period of time Usually in our case approximately a month We bundle all these changes together and we make them a formal deploy to go to production and Before they go out into production though we make this we deploy them to a smaller qa system which mimics production in every way it's just a smaller scale and In in this system we leave the code to soak for approximately a week where again, there's also there's still smoke tests There's still monitoring and tools there, but we also have extra functional tests extra-aggressional tests and extra integration tests as well and And What we find here is a week is quite conservative by lots of people's estimations But we find that sometimes when you put a lot of changes together it may take a week a couple of days before these things and to rid their ugly heads Especially when you're dealing with other services such as keystone or glance which know we're reliant on Swift And you want to see how they operate and at the latest code version like what we have in production so Once this is out there, and we get to go ahead from qa. We're ready to push this out to production and This slide here is just How we do our production deployments and we have a particular way of doing this The two main aspects of it are first of all we make sure the system is completely clean before we start We do a pre-check where we make sure our smoke tests are all running to completion But our thing as I sing as green across the board that there's no errors. There's no problems There's no warnings. There's no drives down and also we have a quick look at our dashboard to make sure there's no odd trend odd trends we want to make sure that everything just looks normal and The second part of the deploy which is important to note is as done a point about earlier We have our production systems split into three different availability zones Which also correlate to failure zones and with three replicas of Swift in our system That means each each zone should have one copy of Swift a Swifter Swift object and so at this in mind we deploy to a single AZ at a time and We do this because the properties of Swift. So if you want to do a get operation in Swift All you need is one replica to be present there And if you need to do a put operation in Swift all you need is 50% of your replicas to land to get a successful To do a successful put so when we do one easy at a time Even if we brought down this day Z by doing just some mistake and a deployer if the code was poor And we'd still have two-thirds of our system. So the system would still be up and there would be no downtime There may be a small performance decrease granted, but it'll still be working and So once we deploy by AZ we go back to our smart We go back to our same pre-check tests and we make sure that everything's running and everything's clean And then once that's done we move on to the next those knocks day Z and just rinse and repeat and if at any stage during this we do see an error in the Coder maybe I think it brings up somewhere all right the monitoring is just goes there fills at warnings We stop immediately and we refer back to the original code and that way our system is still up and we work out what went wrong and The other thing to note here is sometimes deploys require rolling restarts and Two points we like to make in it are when we do our restarts We like to limit it to approximately 10 or 15 servers at a time And if we do any more than this there can be a performance drop and that could affect our customers and Otherwise as well as that we have the reload instead of a restart and the difference here is When we restart a system and if we restart Swift and there's a large transaction going on in the middle of it Say a large pot of a few gig and we want that to run to completion So what reload we wait for all the transaction to finish then we restart the services So there's no no issue the customer will always see the transaction completes and One major actually attribute of Swift is obviously the ring and it's a big and it's a big deal operationally And how to manage the ring in particular and for those unfamiliar with the ring. It's just a basic data structure which decides and where the data goes in the system and Each service by the proxy must have at least one ring file Well the account in the container of one free ring file with storage policy you can have multiple object object rings and The way the ring file is managed is with the swift ring ring builder tool, which is quite good, but unfortunately at scale and It doesn't quite work as well because you have to For each individual and change you make to a ring. So if you add edit or remove a device You have to do an individual call to this swift ring builder So if you have for like as Donna said, we've approximately 8,000 devices and if you add so we were in the business of maybe adding or removing a few a few and nodes or possibly even racks at a time that means we may need to remove say 500 600 devices and to go to this is And it's quite cumbersome to this one at a time and it's also there's a margin for error there You could just make a mistake quite easily and So this in mind we have to come around solution How to how to do this added involved bring putting a wrapper around the swift ring builder and also you use of CSV files And so if you look here to the left you can see kind of a sample of one of our CSV files This CSV file represents and Just there to the left this CSV file represents what we'd like our rings to look like and so you can see and The first three non-commodate lines are three rings and the account ring the container ring and the object ring in this case three parameters beside it are the common ring parameters and 15 being the partition power which is far too low for a production system really, but this is just an example three is the replica count and 24 is the time between rebalance or update and Underneath then we give our list of devices that we want in the rings And we do this on a node by node basis as opposed to a device base and purely for legibility and If we have a couple of thousand devices in a system, we don't want a huge long list We want the operator operator to be able to look at the system in a snapshot and see a small file We can decide you can see what's what and the three parameters here are the IP address of the node and The second parameter here, which is one is the is the zone That's the node that node is associated with and the final parameter is the type It's the hardware type M Which gets read and later on so I'll explain that now in a sec the so with the CSV file We have a wrapper around our swiffering builder kind of a ring script in itself in itself that reads the CSV file It looks at the IP addresses and it looks at the type and it says oh this type is an SC 1170 s underscore 3 So it has a it has a list of supported devices there and it says oh this this device has 12 This node even sorry has 12 devices and all the way to say 100 and it needs to be part of all three rings So it does that it blows up the CSV file without information and it compares it to the existing ring file That's on the system and with that it'll generate a series of diffs So if we were say adding two nodes of 10 devices each you'd have a diff of 20 nodes that needed to be added these get fed automatically into the ring builder and The ring builder will produce our new rings And then they're ready to deploy So everything I've shown you here above the broken line is done offline pre-deploy and everything afterwards is done during a deploy and As I described earlier what we do with our deploys and how we have our production like how we have our development life Cycle that's what we do for 99% of the time. Unfortunately, it rings. It's slightly different Swift rings are unique to each system. So if we edit a ring and And create a new or create a new one we can't test it by putting it into a different system. It just doesn't work and So we need to come up with something a little we needed to come up with something a little more careful when we were deploying them We didn't want to deploy the wrong ring into the wrong system and that could be catastrophic So what we did was first of all when we had our ring files We got the MD5 some of all these all these files and we added them as system system parameters and on top of that at the same time We packaged our rings into a simple Debbie and software package, which we could distribute easily between all the nodes and So when the deploy started in the same fashion as I've described earlier We push these and we push these rings out and they get deployed on a temporary in a temporary area varkash Swift and They're put here and not an Etsy Swift So so we can check them beforehand to make sure so once they're deployed We have another check that runs and says oh I see that we have some new rings in varkash Swift It checks its MD5 son and it sees are they the same if they're the same then it knows it needs Knows it needs to be pushed into Etsy Swift and then the service is gonna pick up on those ring files And you have your new rings distributed and if this isn't the case and for some reason the MD5s don't match The deploy fails straight away and we stop and the reason for this is we can't afford to put in bad rings into the system And it's gonna be extremely dangerous Lot of your services can act up very funny and at worst case scenario you could lose some data and so another thing about when we deploy the ring is at scale and You need to deploy a sequence of rings at a time and On a smaller scale you might only need to do one but because of the properties of the Rings when you do a rebalance you can only move one Replic of a partition at a time so with this in mind you can't Necessarily make big changes at once so there's a technique which is quite common in Swift where if you're adding a certain amount of nodes You gradually add the amount of partitions to the nose in different ring cycles And so when we deploy these rings we need to know when we can deploy the next one and So how we sequence them and what time we can spread between them. So what we have here is we have We have two tools and which are swift tools Which you find very useful in this case the first is we check the replication time which can be found in the swift recon tool and the object replicator simply goes around a node and Checking to see are all the partitions of the objects where they should be and it moves them if they're not and Normally on a clean system this replicator runs maybe in about five ten minutes 15 max However, once you deploy a new ring your partitions have been moved So the replicator kind of kicks into overdrive and it needs to take a far takes far longer to complete and So you're talking about maybe the time could multiply by 10 or 20 fold And so we need we need to wait for a replication time to go back to normal This indicates that all our partitions should have been moved in that time Then we then that's a good indicator indication that more rings can be deployed The other tool that we use is the swift dispersion report and this is quite a handy tool to give a gun of a good Snapshot of your system what it does is it pushes out a number of tiny objects and containers to a set percentage of partitions in your system and Then when you run the report you can see you can check these objects and containers and see are all the replicas there Or where they should be so when your system is in good health And all the objects and containers are where they should be you'll get a hundred percent report back However, again, once the rings changes and the partitions are in different places where they should be You don't get it You don't get the hundred percent coverage and you want to wait till that gets to a hundred percent And then you can do your rings again and a couple of things to note here We don't and change rings every time we see a device or a Driver a server failure. It just isn't really an efficient use of time for us at scale It doesn't work on a smaller scale one perhaps where you've only got a small number devices But because it's done a said swift is so robust and as well with our monitoring tools We can see when a device goes down and changes straight away. We just don't do it. We have about 10 or 11 failures for months So if we do that, I mean if we ended up doing all those ring changes We just spend our whole time at it wouldn't be any use and also swift proxies obviously aren't part of the ring So they're much easier to add or remove you can just take the files You can add the files to them if you want to add as a proxy you can pull the file files away from if you don't and This is just a graph of our replication time over the course of a month where we did a few ring deploys So as you can see for most of the time, it's extremely low. It's nearly at zero, but it spikes when we do a New ring change it goes up as far as see 1,500 minutes in this situation Which is approximately 24 hours that means we got away 24 hours before we can we can do this again As long as the dispersion report also gives us a hundred percent coverage and these are quite spaced out But from what I was saying earlier as soon as that hits zero again, we can go again So these these four changes could have been done over a much quicker much quicker period Just leave don't dare to give a summary Okay, sorry, okay So this when we were designing this this talk we we were trying to understand What we would let want you to take away So what so one of the purposes of the talk was to explain it if you if you're an operator the sort of things you need to worry about But even although that that's important for you Time didn't allow us to go through so when I went through the monitoring I went through it very very quickly so And originally when I did my first slide set that was a very big slide by slide by slide So you'll appreciate that although monitoring is very important to you. We can't really go into too much depth here but the From our point of view is as I say we monitor everything Swift will continue to operate But by being on top of your monitoring You know when things start to break and can start making repair actions before things become a problem Okay So that's important You also want to make your operations your procedures Repeatable so whatever you're doing in your data center So at one stage we had a deployment that actually failed on us and we we had a little bit of downtime It was due to a bug and my code I'd written some code and that ended up uninstalling all the software on one of the AZs So afterwards we've tried to figure out what went wrong because as Lorcan was saying we've got a pipeline where we test What we realized is that different operators are doing different things and that one operator Kind of hid that as a problem and then the other operator who deployed on the reproduction system followed a different process So what we realized was that we need to make that repeatable and so that's one of the reasons We have the runbook and that's a major reason why we have the runbook But it also by doing that then you if you've got a procedure that works Then you're sure that it's always going to work So as I said keep on top of the problems and so one of the things that say that people obsess It's there certainly are some of our colleagues in the knock in the network operation center watch the async pending and That's something that can go very very high and there's all sorts of operational reasons why it can happen However, if you understand why it's going high then don't panic Okay, just get around fix the problem and that number will eventually come down Okay, we we've seen numbers up in the millions tens of millions of async pendings, but that's fine on a very large system So don't panic when you see it, but figure out why it's happening and then go fix it So what will you be doing day-to-day as an operator? So general break fix this these are large systems. So it's them break So that happens at a pretty normal rate probably a little bit of a takeaway for you guys is that in at least in our experience you or ease is a reasonable amount of overheads that's something you need to fix and It depends on how many spinning drives you have so we've so many of them as I say what we're looking at one to two a day Or one to two one. Sorry one every one or two days. That's out of order And obviously you need to review your system state. So if you're monitoring your system, please Please watch the monitor and then finally we didn't really touch on this But the other thing you will get is users are going to have problems operating with your system So you yourself as an operator need to know how Swift works in request terms and especially some of the interaction So one of our biggest areas that we're pre-com to get problems is things like credentials So they say they got a problem at Swift turns out that we're problem with Keystone credentials That they don't have the appropriate roles to use Swift for example So you need to be on top of that understand that because otherwise you won't be able to handle user queries For some reason static web seems to cause a lot of problems for users as well So that's another area at least that bothers us The tip typically cases that somebody's got a public container They're viewing it in the browser they enable static web and then they get a 404 not found The reason is when they enable static web They said they have an index file indexed at HTML file It goes look for it, but they actually didn't put a HTML file or a H an index file there So it's not found and that's the reason So those sort of little gotchas can happen all the time Okay, so we can take questions So if you're gonna ask a question if you can go to the mic there It's the quick question about drive failure. So when a drive fails, do you replace it immediately or do you wait for a bunch? Or what's what are your kind of thoughts on drives? We What by immediate I wouldn't say immediate we as soon as we notice it it gets scheduled for replacement And then it's up to the the data center operations people to replace it. It tends to happen on a Shift by shift basis. So it doesn't happen immediately So it can be up to a day before the drive is replaced sometimes depending on if you're unlucky But yes, we don't wait for a bundle. So yeah, we're not lights out. We're actively managing it Kudos Donna for going over all this with some great operational experience. Thank you for sharing I thought the ring packages was a kind of a cool trick when you guys are going through the The replication cycle time doing like a gradual capacity adjustment How are you just looking at the recon files to get the replication cycle time? And then you had graphs so you're feeding that back through like stats the emissions or so and that's our dashboard Yeah, so but we're using yet. So Recon swift recon Measures the replication time. So yes, so we're plotting that data we we basically there's a Metrics pipeline in our data center that we feed that data to okay the other thing is we We also use the the recon for replication is occasionally the replicators get stuck and So you you see in recon as the last replication period So if you see that never moving then you know you're in trouble and you may need to restart the replicators Yeah, so the data is there so if recon gives it and we just pull it out relatively easy quick question Yeah, so do you load balance user traffic away from zones that are under maintenance? No protect the read. No, so you just mix it. No, so actually I noticed look and actually just mid at that point if you notice during when we're restarting There are occasions when we need to restart demons and so we actually limit the number of Demons we restart so our core thing is to do a reload rather than a restart But even in the reload We limit the number we do because there is a It's very small window where the low balancer thinks it has it is just talked to one of the proxy servers So it thinks it's up And then we're in the process of reloading it so to send its next request in there and possibly get a failure So we keep the numbers relatively low So that that we Typically users we expect users to do at least one retry Okay, and that that's a core premise And so that's that's how the system operates. Okay actively Hi, not a not specifically a swift question, but what's the what is your policy on? Former consistency or inconsistency between your AZs. Yeah Clearly any yes, so we we we check that as part of our processing. Yes We we have a kind of a baseline of what we expect the firmware to be It's sometimes this can be a little hard to figure out and sometimes you need to have your systems under Use your systems for a while to realize whether or not you've got the correct recipe But we have seen strange operations at the swift level due to swift firmware versions not being correct So that is something you do want to check Yeah, but within the AZ do you run a consistent firmware across all your devices your service? Yeah, all this all the firmware should be a level playing field across our Crossfire hold system not just very easy Yes And of course one of the problems is in a break break fix environment Somebody pulls out some component puts in the new one the replace component may not be maybe an old part So may have out of date firmware. So that's one reason for tracking your firmware So you have to reschedule to get that fixed So actually I have two questions one is about the disk recovery. So right now you use six terabyte drive Do you have a number like a how long average it takes to recover the data? and it's it's and It's dependent on how many and your ease you have in your drive You could have multiple your ease in your drive and that just takes longer So I couldn't really give you a figure. So let's say a typical case because right now the trend is driver It's going to be larger and larger, right? We have a terabyte drive by the end of this year and everybody use 10 terabytes next year Yeah, so well that'd be a problem for the swift design. So you're talking about a complete drive replacement the data drive Yeah, so you completely replace a drive. Oh, that's just the replications. So how long is that taking and? It depends it depends on multiple things I mean if you have if you depends on the capacity of your disc if your discs of 50% might take I think at 50% and It's replaced might take about 24 hours again. It depends the number of partitions and whatnot. Yeah So they're what they have to do that question. They're relatively small drives We're looking at two three terabyte drives. So we have seen up to 24 hours to For a replacement of a single drive for it to get back in sync Okay, another question is since you guys run the public cloud Do you guys do anything like to protect? Other customer from one or two particular customer who use the system heavy heavily Sorry, I'm not so I hear how you do the resource isolation part Okay, so Yeah, so so we have rate limits on the number of operations you can do on a container and That's pretty standard practice in Swift, but that that's kind of more and dedicated to very very large containers Can can slow down so you do per user container? Yeah, but but the the core the core answer to your question is no We don't have any particular mechanisms for mitigating and the noisy neighbor problem as you say One of the aspects of Swift is because of the way it distributes everything everywhere As I said if you if you if you even somebody does Two three thousand operations, they will visit most drives on our system So as a public cloud vendor. Yes most you will see operations happening on most of the drives However, it is remember it is the system scales out. So as a result You know with you And an individual will find it very hard to target any one bit of our infrastructure, right? Because you you even on that unless you repeat it again the same operation to the same object all the time You won't be going to the same place most of the time Okay, thank you. Yeah, I think we've been told to stop now. So if anyone has any other questions just come up afterwards Yeah, okay. Thank you