 Good morning. Thank you for coming today Paul and I are going to be talking about Erasure codes and kind of give you some information on what we've done as far as testing on Erasure codes with The work that's been going on over the last couple of years. So by way of introduction. My name is John Dickinson I am the project technical lead for open-sack Swift and I work at a company called Swift stack This thing on. Oh, yeah, hey rock and roll. All right. I'm Paul loose. I'm not the project technical lead I'm one of the one of the core developers in the Swift community And I work at Intel So the first thing we want to talk about is great. We've got Erasure codes So what in the world does that mean and what are these things and how does that actually? What's different now so let's talk a little bit about the differences between the two so we Historically or when Swift was first written it was Written to support durability by using triple replication or replication, which simply means that you take one piece of data and Then you store it multiple times In the system and the way Swift does that is really intelligently so that it's going to store each of those individual copies on multiple physical failure domains whether that's isolated by a drive or a server or a rack or even a data center and so you can have Data go down and you still have or hardware go down and you still have availability of your data And that's what you have to do if you need to have durable data You have to store your data more than one time and replicas are a really good way to do that They're really simple to do and they work for a lot of use cases But the cost of course is that if you're storing one gigabyte in that you've got to store three gigabytes inside of the cluster and Some people have always historically come up to me and said well How do we make that better? What can we do and the answer? It's always like well How much is your data worth? Did you just really want to lose all your data? Then if not then? You need to store it multiple times But there there could be a better way there could be some other ways that you can still get high durability while storing less data and The way that's done is with something called erasure codes basically erasure codes are a way that the data comes in and then it is It's processed in such a way that it's broken up into different fragments that are each much much smaller so that the overall The overall data that needs to be stored in order to rebuild or reconstruct the original data is Not much less than say that the three times overhead you would have with erasure codes and with replicas So in this case you have maybe a lot of fragments you have a much more than three you may have Somewhere around 10 you may have more you may have 20 or 30 different individual pieces and those individual pieces will then also Be spread throughout the system such that they're on different individual failure domains But the end result of this is that you get the same durability protection but you pay much much less in the overhead of storing the data so you may only pay like 150% cost for the for the raw bytes instead of 300% with a triple replication so what we have worked on in the last two years inside of the Swift community is The ability to support Erasure codes and the way we did that is we first implemented something called storage policies Which allows you to take a set of data or a set of your hardware and isolate it and and manage it a little bit differently Than the rest of the cluster so that you can say that this kind of data is going to be stored one way And this kind of data is going to be stored another way And once we had that functionality built in then we implemented erasure codes as a storage policy for Your cluster so you can say that now I have a triple replicated set of data and I also have an erasure coded set of data and that can be on the same hardware or it can be on different hardware and What that gives? Deployers is a lot of flexibility in how they want to store their data And it gives the end users the ability to kind of control what the actual cost they want to pay for Storing their data and so as a result of that Intel and Swift stack and Paul and I and various other people at our companies have been spending a lot of time Saying the great now we've got erasure codes so what does it look like and what kind of how does it perform how does it compare and What what kind of recommendations could we give people who are interested in doing this? And I got to say it's really nice to work with Intel because they have lots of great hardware You know I should I should add I think it's clear by the title to talk We're talking about performance here if you're interested in exactly how erasure codes work Either how they were implemented or what the theory is behind it There's lots of stuff available including go to the Paris talks Kevin greening who's one of the EC guys He and I did a talk talking only about the exactly how erasure codes work both theory and implementation So we're sticking to performance here, and I want to throw one other disclaimer at it Well, I'm talking about performance. This is performance as done by developers. Okay, so we're not performance guys So if there's anybody out there that makes a living doing performance benchmarking analysis You might be disappointed. We're not as systematic and detail and thorough as Some of you guys are cuz we're just poor developers But yeah, that said we really had three goals with our performance work Number one we really wanted to stress the system not caring so much whether we're Optimizing every little knob and tweaking every little thing trying to eke out every last operation per second But we're really looking at integrity data integrity and integrity of the system and our first We've been doing this for months and months and months and our first several runs on this hardware We have we found lots of issues. So so that was first and foremost We wanted to stress it on on a larger cluster that we typically test small features with Number two we wanted to understand the the characteristics Of how erasure code worked in the system under different loads different workloads different use cases Different sizes of objects and different and different read write patterns put get patterns and Number three, of course, we wanted to actually collect some real numbers And compare them against the same exact system running triple replication to give Folks an idea and to give us an idea of like John said how to make recommendations on when to use EC And what if anything should you be tuning for EC and what kind of you need to look out for? So that's kind of what what was behind this the the picture here is of the the main cluster that we used to do this This was a 16 node cluster in Phoenix, Arizona Primarily e3 processors and the storage nodes e5s and the dual proxies with a load balancer sitting on top Four different Xeon machines for workload generators We used a combination of Cosbench and SS bench So we used multiple tools because again, this wasn't so much about Trying to come up with one systematic performance study as it was throw the kitchen sink at this thing and try to break it Oh, yeah, and also let's see what performance looks like right so that was kind of how that worked This is all 10 gig on the back end We had 20 gig ingress from the load balancer 10 gig on each proxy and then 40 gig out to the cluster So you'll see most of our numbers we didn't we weren't network bottlenecked because we're looking at other things But we have plenty of overhead and we still have the cluster. We've also got some Swift stack so we had a smaller cluster in our lab at Swift stack, which is a little more network constrained it merely had 10 gigabit networking and So you'll see some of the results of that so We wanted to cover two things one is kind of like the holistic system performance differences and this and the other one it like kind of the end-to-end sort of thing What does it mean for Swift as seen by an end user and then? secondarily we focused quite a bit on what is the actual Hardware resource utilization and what does that mean for CPU and memory and and that sort of thing so first off the performance differences And what does this look like so? Kind of that's the first thing that really people care about I've got replicas should I go to erasure codes or not so if you have say a heavily loaded System and you have a lot of traffic to the cluster at the same time and you're comparing Okay, what kind of various erasure code policies am I using how do they compare to one another and how does that compare versus replicas this is This is a picture of what that basically looks like for different object sizes for replicas versus erasure codes And you can see there is an actual divergence here So the two at the top are replicas. That's 2x replication and 3x replication and the The interesting thing here is you'll you'll see that it's probably very hard to read But the the y-axis here is megabytes per second. So we're actually Ending up network bound on this. We're saturating a 10 gig pipe really on both situations But really what I want to focus on is the trends. What do you see here? It's not so important what one particular data point is but how does this perform over time and throughout the entire cluster? And so the important thing here is that you'll see that along the x-axis there we're going from smaller object sizes to larger object sizes and Especially in the early stages there when you've got the smaller object sizes Which as we know there's a lot of very small objects being stored in Swift clusters today Replicas far and away are much better at reads. So you're gonna read your data. You're gonna get better performance out of replicas and that makes sense because The data is taken off of the drive. It's piped over the network to the proxy pipe to the network to the client And that's it. It's just kind of really straightforward data path there and when you're doing the erasure codes It's pulling that from multiple servers and then it has to recombine those with the erasure code library and then it sends it out So obviously that extra overhead of reconstructing the erasure code fragment Fragments into the original data will necessarily add some overhead, but the important point here is that We've got we we're able to saturate especially as we get to larger larger object sizes We're able to saturate the network. So it's not actually It's not slow. It's just when you get to larger object sizes, you're gonna probably be network bound just like you would be through the replicas Now let's compare that to gets I'm sorry to puts so if we have an active cluster and we're pushing data into the cluster So think about this if we've got replicas and we're storing say one gigabyte And we push that into the cluster and you have triple replication What happens is that when it gets to the cluster as the client is streaming it to the cluster the cluster is then streaming it out Intriplicate so three times which means that your internal cluster network requirements are basically your replica factor more than the internal side So if you're sending in one gigabit per second, you need three gigabits out on the back end side So what that means is if you have a 10 gigabit network You're gonna saturate it at roughly 333 megabits per second because you have to send that out three times for triple replication And that's exactly what you see here the red line that initially starts fast and then kind of plateaus there plateaus Right above 300 big surprise now the green line that you see going up plateaus right around 500 That's a 2x replication to kind of a reduced redundancy sort of scheme now the erasure codes when they come in They're not actually sending out that much as much data as the replicas over the internal network That's kind of one of the benefits you get with the racer codes So that you can see that you're still able to get full network utilization at least on this 10 gig network and the various The various different erasure code schemes that we were using we had lots of different parity and data bits Tested here kind of all follow the same trend such that the larger objects show better Improvement as far as what is the actual throughput we can achieve from a client perspective when we're doing erasure codes versus replicas so this is actually exactly what you'd expect and Based on these tests and kind of what we're looking at We think that the erasure codes are really going to start outperforming replicas When you're talking about rights in a busy cluster Probably when those rights are about 8 megabytes a size So there there's a few different tests that you can see that it's you know Some tests might show a little bit before that some may show a little bit after that Paul's tests in the larger cluster showed something that was a little bit after that But like I said, we were a little more network bound in this smaller lab cluster So the point is as you get larger you're able to get better throughput with erasure codes, which is a really interesting point So let's look at not the network throughput with the actual number for operations per second So we've got kind of a 5050 read write workload here and then just the read and just the right The red bar is replication We've got 4k objects use replicas obviously, especially this is that's I don't know. Why that middle spec There's on is on reads and so you're you're getting up to 10 or 12,000 operations per second pretty easily with the 4k the 4k reads and obviously that just radically dominates or just blows away the erasure code performance when you're dealing with those Very very small objects Now we know that quite a few swift clusters are using 4k objects 10k objects 100k objects And so when you're looking at if that's your workload, that's kind of something you need to take into mind and realize that well Replicas actually work really really well for this kind of content Now what happens if you take instead of 4k you did four megabytes In this case, it's interesting because they start getting a little bit closer replicas are still Still a little bit better both on reads and writes and when you have a combined workload So it is But but you see it a pattern starting to emerge here And then when we get up to something that's much larger like a 64 megabyte file Well, guess what they're basically going to be the same performance based on our testing is that the reads and the writes Are roughly the same in this particular set of tests You can see that the erasure codes were a little bit faster on the rights, but the reads were basically the same But then again the replicas when a 5050 workload were a little bit higher The point is though the trend so we've got the the small objects really good for replication You've got the larger objects. It's a much better fit for erasure codes or erasure codes are much better fit for For those larger objects so that being said what's going on on the inside of the cluster Two years for those three graphs. What do you think? Okay, so the so the next set of slides is really focused more around Internal what's going on in the cluster and I'm gonna flip to the first one in fact John if you could steer these I'm gonna go over here and point because it's gonna be too difficult to point from from up there Okay, so again, this was developers doing these tests So I'm gonna have to explain how these charts are to be read and interpreted And then again the disclaimer we weren't going for make this run as fast as we can I mean obviously that was a background goal what we're trying to do with these charts is look for trends within EC and within replication and See if they match Irrespective of each other so right does EC perform in one particular case better than replication does in that same case Or did they base basically both follow the same kind of trends and from a developer's perspective? We're looking for anomalies that we didn't expect right so this wasn't as much around the performance numbers themselves as it was Understanding and the code that we've written and how does it behave? So that said this top chart here represents Erasure code and the way you read this thing is this is total cluster proxy throughput the blue is coming into the proxy and the yellow is going out and This is several tests in a row. So this is sequential in time These numbers don't exactly line up, but the order is correct So one megabyte 4k 4 meg 32 meg 6 4 meg 512k you can see that roughly right here This is the one megabyte test for replication And you can see you know Close to 4 gigabytes per second out So, you know, there were a lot of puts going on there and some reads going on there This was this was the one megabyte test and there was some cleanup between the tests And then the 4k tests and some cleanup and the 4 meg tests so on and so forth This was a separate separate test right separate series in time But they're just sort of laid out here So they kind of sort of match up and you can see where the peaks in the valleys are roughly the same So not so much a comparison between this chart and this chart like I was saying but a comparison between What does this profile look like and what does this profile look like and are they roughly the same? And you know we went through and and looked through all this all this stuff and really the bottom line is in most cases They are pretty much the same which is good. We didn't have anything we have one bizarre thing We have to go look at but it's you don't see that on this slide What you can pick out on this side is some things I've highlighted right here smaller sizes as John's Outside of the cluster looking in showed EC is much weaker right there's some gets at at one megabyte and there's some gets at one megabyte in replication Obviously a lot more throughput in the cluster with replication with it than with ec And then on the larger object sizes you can see where ec peaks a little bit higher And that would have been the 64 meg right around in that area And then we had a few little anomalies collecting the data and one of the tests hung for a little while So some weird stuff there Also one thing we wanted to note. I mentioned we were trying to throw the kitchen sink at this thing We did more than just vary object sizes and put get ratios. We also varied the ec parameters Which includes the ratio that you use most of this is 1014 Swift stack did a bunch of tests with other ratios. We did some others as well We didn't include it all in here because we'd be here for three hours Just looking at graphs and everybody'd fall asleep as a way of clarifying that what that means is that When you when you take the data in your ratio coding at 1014 or alternatively phrases like a 10 for it means that you've got 10 data bits And for parody bits which would you can see that if you've got four extra bits there That's going to be a 40 percent overhead. So instead of like a 300 percent overhead You've got a hundred and forty percent overhead for this particular durability scheme And you can lose any four of these individual pieces and you'll still be able to recover the data Yeah, yeah, thanks for that John. So another nothing we varied with segment size Which again you kind of got to go watch the ec talk to get a feel for what all these parameters are But that's something we messed with looking to see if that had a significant impact and it doesn't in general I mean it has a little bit impact but not huge and we also toyed around with the various chunk sizes The non ec related parameters right the incoming network chunk size the dischunk size both at the storage node And at the proxy you're trying to figure out exactly how all this data flows And you know at a high level in the ec world We're bringing in data and buffering up to a segment size in the proxy before we actually start pumping it off the back end So we've kind of got we've got a fill a buffer before we start putting it out there And there's multiple buffers to deal with so we toyed with those sizes a lot as well. Okay, next slide John Okay, so this this talks about some of those ec variations. So this is actually five different tests These are all the same series of tests like like we saw before so starting with one mag and one of four mag and that sequence of tests This first one shows ec with a 640k segment size and 64k chunk sizes across all of the network And disc parameters that you can configure in Swift So it's basically all the defaults. This one is replication And then we have three different ones following up after that after that showing Variation of segment size and dis chunk size and proxy chunk size So you can see there there are definitely some differences But no gigantic huge. Oh my gosh, if you use ec the defaults for Swift don't work We absolutely did not come up with any conclusions like that all of the defaults that we have right now You go to master and pull it are Pretty good Okay, so let's look at The the number of requests so this is really just another picture of the data That reflects what John showed right which makes sense, of course We're not going to see a different number of requests by status code internal versus what the the benchmarking code is seeing You know looking into the cluster So again same same series of sizes through here This is actually should be maybe shifted over just a little bit So this peak lines up with this peak and this one with this one But again, what we're really looking for is The trend going this way in ec where we see the highs is it basically the same as we see in in replication And the one stand out here, which makes sense because John just explained it was a small Small files that's this gigantic peak that sort of blows the scale out of the water here On replication if we didn't have this thing going up to over 10k requests per second You would see the pattern of these peaks is basically the same as the pattern of the peaks in ec So they trend the same when we beat on the cluster with different sizes. It's just the absolute values are different So that's again a really good warm and fuzzy for us on the development side Okay, so now let's take a look at CPU utilization So this was a big one that there was a lot of speculation on early on we started the project You know how much CPU are we going to be eating up doing a racer code? It's complex math There is no math in replication. So what's going to happen? So here's the picture. This is a slight different tool collecting this data So what we're seeing here is all of the ec runs in this piece of time the same series of sizes And then all the replication runs a couple hours later And this is cp utilization at one of the proxies and we looked at both and they're basically the both the same thing So they were pretty evenly balanced. You can see that Across the board the average Went up from about 10% to 16% utilization. So not significant But worthy to note is that there, you know, quite a few more peaks In the ec side than on the replication side. So much smoother cp utilization in doing replications You know puts or gets versus on the ec side But overall not not a huge tax on the system So more good news Again more more good news because this was expected So we like things that come out the way we predict them. This is our memory utilization at the proxy at either proxy Wow Looking good So what the graph is showing is is free memory or available memory and you can see before the ec runs start over here We've got about I don't know 60 gigs worth of memory available at our proxy So both our proxies have 64 gigs of memory in them as the tests progress. We end up chewing up tons and tons of memory We get to around the 64 megabyte Section of the tests and you can see we're eating up quite a bit of memory And when we finish the ec tests and we move on to the replication tests all that memory is freed up Right. So this is all going as I mentioned earlier. This is all buffering segment size Right. So if you're doing triple replication of a 64 megabyte object The only thing you're buffering in proxy memory is the size of those those chunk buffers I mentioned earlier But any see we have to buffer an entire segment which in our case is one megabyte So for every object that's coming in we're going to chew up a megabyte of memory while that object is in flight So that's what this reflects and that's as compared to what the just the network chunk size Which is by default in Swift a 64 k So the difference on what is actually in RAM in the proxy server and this is exactly what you see So the replication run is only for each request Buffering 64 k at a time whereas the replication of the erasure codes is buffering up to a megabyte of time for every individual request Which perfectly explains exactly why you would see this sort of memory increase memory utilization on erasure codes Okay next and by the way feel free to stop me if you have any questions and we'll take them as we go Or we can hit them at the end So now let's look at CP utilization down on the storage nodes Now on the storage node We did have a lot of changes in both the put and primarily in the put path That really don't have anything to do with EC They just had to do with how the implementation came out and how we had to deal with getting the data down and making sure we've got all The integrity that checks that we need in place So there's a lot more stuff going on down there then Then there is in replication However in these tests none of it is is actual EC math Okay, so all of the math is being done in the proxies. So that's the way our design is When you're doing ingest right? We do the calculation of parity in the proxy and the storage nodes just accept blobs of stuff They don't know what it is They just put it down to what they're told and then on the get side the storage node Just spit back their blobs and the proxies responsible for doing the math and reassembling the object The only time the storage node gets involved in erasure code operations Specifically is when it's reconstructing data right because of a rebalancer data loss or hard drive loss whatever But that wasn't going on in this picture. So what you're seeing here is roughly equivalent CP utilization a little bit choppier over here again I said like I said, we did make some changes in how objects are stored But it was more sort of housekeeping stuff because of how Swift works as opposed to erasure code specific and then Chopper replication over here. So nothing really significant there, which is good because we didn't expect it to be Okay, this is this was a little unexpected. That's the word yikes You might think this is a picture of the proxy node based on what we just described But it's not this is a picture of a storage node just one of the storage nodes They all look like this and what's happening here is during our ec runs We're chewing up a pretty significant amount of memory. You can see You know, like 20 gigs of memory, I think it's 19 gigs of memory That we're using throughout a variation of sizes and operations whether their puts gets or a mix of operations or not And then when we move into the the replication runs, you can see a significant amount of that memory is available again We're still investigating this. We don't have we've got some ideas, of course because we wrote the code But there's there's nothing that stands out. Oh, yeah, we forgot to do blah and that's why we're using up all this memory. So This is something that we will be addressing soon or at least understanding and documenting so that folks purchasing Equipment and deploying for you see know what they need, right? We already know on the proxy you're gonna need more memory on the storage nodes if you do it today You need more memory maybe tomorrow you want but so that brings us to Okay, now what so we've kind of looked a little bit at the CPU utilization and the memory utilization on the proxy server and They particular storage nodes and kind of how's that characterized and what does that actually mean you kind of understand a little bit about from the end user requests what generally does that mean and So we know that In general the ratio codes are gonna be better for large objects and replicas are gonna be better for small objects And there's obviously tons of different knobs we can tune in there And there's things that you're gonna have to take advantage of there, but that's kind of Kind of high-level things so what are some of those kind of use cases that we already know that people are interested in and asking for Erasure codes like why do we actually make this and who are the people who are wanting this? So there's a few things that I would suggest that are really really good use cases for Erasure codes backups fantastic use case video storage because they've got these large data files and Kind of the I know some of the biotech industries using Swift to store like lots of genomic data and things like that But the really the point is you've got these large data sets. That's kind of this Worm like data. It's got the the right once, but then you can read many times You don't really overwrite your data. You don't You don't go out and delete it You're not putting this massive amount of data in but you've got a big data set that you may need to read or it's in the case of backups you hope you'll never have to read and But there's this kind of large chunks of data these kind of things really great for erasure codes If you've got something that you wanted to you wanted to store some backups then being able to configure a storage policy For erasure codes and then putting all of your backups into that particular container that's content for that storage policy great idea and And you know one other thing and obviously that the common denominator here is large objects But that's that's us making an assumption that it's a requirement that the performance be the same if you're switching from triple Replication to ec there is a price performance balance, right? So it's not the case that every usage model says I have to meet the performance that Swift happens to hit with triple replication today If you're willing to pay 20% performance to say 50% on cap x and op x then you might use ec for smaller I'm not saying 4k like we saw but but you know I just don't make sure you don't walk away thinking ec is useless unless I've got 64 megabyte objects not the case it depends on what you're willing to pay and that price performance balance. It's a really good point So that being said then does that mean we should use everything and go for go for erasure codes all the time If if it's gonna save us some money because you know hard drives are what's really dominating the cost of your cluster She's registered codes all the time and I would say no don't do that Swift is still really really really good at replication and there's a ton of use cases that are there this is not in any way a replacement of Replicated storage inside of Swift this gives people a new ability to do new things with Swift and to be more efficient at a few of The the things that people have already been using Swift. So we know that Swift is used all over the world We know that people are using it for the document management stuff being All the online content media storage video streaming games CDN data processing mobile content all of that kind of stuff but one of the very very important things is that you should absolutely keep erasure codes inside of a single geographic region and Replicas if you've got multi-site requirements, which I know many Swift employers do you really want to use replicas for that and people would think about it It's like well, but what I really want is I Want to really want this the advantages of erasure codes, but I also want it to be distributed globally and Not yet. Don't don't do that yet. Let's do one thing at a time walking before we run, right? But we will get there so those are things that obviously people are asking about and very interested in and so Again, it becomes that price performance balance if you've got the requirements for particular Performance and you can pay a little bit more in the hard drives and or you have very small objects I mean you make sure that kind of things work a little bit better So with that then replicas are going to be really good if you need to really save the money on the hard drives and you can pay a little bit more in the extra CPU and the extra memory Requirements and you don't have a lot of those really really small objects then EC is a really good consideration there. So that being said Where are we going from here? We're certainly not saying that EC is done just in the same way. We're not saying that Swift is done. We're continually working on it every day And so we continue to improve it. So what are those kind of things going to be doing? The biggest thing that I think jumps out is the small file issue And I know that there are people out there who say I can't do erasure codes right now Because I actually really can't control how my users are putting things into the cluster, especially that the public public service provider sort of model you need to be able to If you're going to segment your data for large data goes to erasure codes That's really great when you can actually control the applications that are sending the data into your cluster But what if you can't? Well in that case, it's kind of hard to use erasure codes because you're going to configure it and add the extra capacity You get the cost savings and people are going to put in a billion one byte objects and you're going to think That wasn't a good idea So in that sense, I think one of the biggest obvious things is to inside of the erasure codes Offer a better or inside of Swift in general is offer a better automatic way to handle small files So that when you put small files in we're not trying to do Erasure coding with one megabyte chunks when we have 4k. Yeah, I think the key there's automatic transparent, right? Right, right. So there's a we it's not we're not short on ideas We've got lots of things that we could do but but the key goal there is the application shouldn't have to care We want to give them better performance on small file Small objects through some mechanism behind the scenes. Yeah, another thing that shows up Especially if you're looking at those memory graphs is saying that well, what's going on with those memories? the the remember usage and you know We've got to figure out where exactly that lines up confirm that with other tests and other clusters with different configurations And make sure that that's something we find the root cause on and we make sure it doesn't go away But in general that's no different that we're doing for anything including all of the replicated storage today as well The point is now we as we go forward the future work is as we find bugs We fix them and we have some ideas on how to make some things better and how to improve those sort of things, but that's not at all exclusive to erasure codes and so overall we've got to Continue to make Swift better What's really great about that? I'm gonna come back to this. I think the last the last part to talk about as Kind of future work is there's been obviously a lot of conversations in the in the community as well about now I've got erasure codes now I've got replicas. How can I automatically move from one to the other based on when? Some policy as far as how hot the data is or how old the data is or something like that and Yes, we hear you. There's there's huge amounts of People talking about we need this data tiering. We need to figure solve how to do policy migrations and management And so that's things that people are working on and I think that you will see that inside of Swift over the next Work on that over the next years to come. So definitely That's where we are but to back up just a little bit I think the the way I characterize that saying that We're gonna keep working on erasure codes. We're gonna keep working on our on replicas We're gonna keep working on adding features that are supported by both And as we find bugs, we're gonna fix them So that leads to the question of should you use erasure codes? Should you actually go do this and I think the answer is yes, you should Because in my mind, there's no such thing as saying that well, is this going to be Is it is it ready? Is it production ready? Is it not production ready? There's there's too much gray area in between the middle Basically the idea in my my perspective is how are we as a community managing this particular feature and treating this? And that is do we treat it? Do we know that based on all of our testing that it works? Yes, do we know that it's going to be able to doably store your data? Yes, do we know that it handles all of these edge cases that we know that come up? Yes, it handles all of the failure it handles the capacity adjustments It handles all of that and there are no known critical issues in this data path as far as well You might lose data. It's not the case We don't know of anything like that and that puts it in the same categories We have as replicas and saying that we will continue to develop this We will continue to improve things and as you were going forward I would strongly say that yes, you should consider using erasure codes Like any computer any storage system I would suggest that you try it out before you dump three petabytes of production data into it But yes, I think it's absolutely something you should be looking at today And again Where are you going to use it in those kind of price performance? balancing things especially with the large files and And stuff like that. So that being said, I think we have a few months left for questions And we have a mic coming around here Yeah, while we're waiting on the mic we should add for the On the community side of things that initially like any large thing that started off with a few people working on it and slowly grew And I'd say since we moved to master months and months ago The number of people getting involved and helping fix things and make things better has grown just unbelievably new people from IBM and HP And just all over the place jumping in so to that same line is it is it treated like production code? It absolutely is there's a large number of people in the community that know how to get in and make things better It's fantastic if you'd like to copy these slides the link is at the bottom of the screen So this is really interesting I'm just curious if there's any difference in the data durability guarantee because the erasure codes is like you're splitting it up And maybe the original data isn't there right if only one block is written. Yes There's absolutely differences in the data durability guarantees and you generally it's generally it's better with the ratio codes We've actually got some tools for that. Yeah, and so The reason is it might not just like wait a minute, but we're splitting it up And we don't know exactly what is and if I just have this one thing But the way that it works is and I can't really go into a lot of detail on this We just really don't have the time But the basic way it works is it splitting it up so that you can if you're doing like a 10 plus 4 like we were Talking about earlier you can lose any four pieces But if you're doing replicas you could lose any two pieces, right? Because you still have to have one copy of your data someplace, so we're able to reconstruct how things work there Yeah, and again, we've got a durability calculator I'm not sure if anybody knows that you were off the top of your head But but you can go in and punch in your number of dish your ratio code policies versus replication It'll give you an actual number of what your durability looks like so actually made I understand that the fact that you've got extra 4-parody, but it's not too Effectively, but it was more about when you do the put Guaranteeing that the data is there before that you get back here, right 200 status, right? So right now with Swift we will not return a successful Response to a right until in replication until we know that it has been durably Flushed down to the disc at least in a quorum, which is at least more than half of the number of replicas With erasure codes we're doing something very very similar in that we are making sure that it has been durably flushed all the way to the Drive in the same way for a quorum, which means basically, I believe that's the Database plus one. Yeah, we changed it three times, but yeah So the point is though you have the same basic Characterizations of if you it in complete worst-case scenario after you get a successful Response you in triple replication versus erasure erasure codes. You're still able to withstand the cluster still able to withstand a hardware loss and still have your data Guaranteed or your money back? So have you done any any characterization of reconstruction performance at all? Yes Yeah, we don't have any slides on that. We were focused on just the put-and-get path That's a big area for us moving forward is to continue to look at that and understand what's tunable Really the framework there was was carried over from Replication and some of the tunables may not make as much sense based on how we wrote the code So we've still got some work to do there, okay? There's one in the back. It's question the back Just a quick question about the performance I didn't see the number of threads you use and I know that it's as a big impact in general in the performance So have you tried with a different number of threads to see what's the best number of threads that you should use to get the higher performance? In proxy and object server or in workload generation in the workload generation in workload generation We've we pretty much ran with everything I think some of the early numbers you saw were run with pretty low concurrency 100 threads We're able to saturate 10 gigs in the resource utilization test There was a mix of 512 versus 1,024 versus 8,000 so not all of them reflect in the slides Just because again, it would take forever to go to that much data, but But yeah, we went all the way up to 8,000. Thank you Did you guys measure first time to bite time to first bite performance? Did we have we I know we do measure it we we might have it after I mean, I would I would have expected a ratio coding to be a little slower, right? Was that the case is that what you could have said I would expect did you better measure that dog? Okay, so to repeat that for the video for everybody and so everybody in the room here Doug who works at Swiss stock and did a lot of this Irish code testing. Thank you is We did at the results of that were that for the larger objects the time to first bite on reads for larger objects with a Racer codes was indeed smaller than for replicas as Kind of impenetra seeing as well as the overall throughput is better as well For smaller bites for smaller objects replicas were better So do you have some ideas and how are you gonna solve your small object problem? I'm sorry the small object problem What kind of ideas are you guys batting? So there's a few things we can do and like like Paul said there's not a shortage of ideas But we haven't really converged on saying that this is the way we're gonna do it Some of the things we talked about pretty early on were okay. Well instead of trying to actually erase your code This let's just do something simple and if it's a 1k object. Let's just sort 14 times and that'll be fine and the benefit of the performance on that would be better than The extra cost of sewing 14k versus 3k with triple replication other things you could do in there is potentially figure out the erasure code the the object size and Do something in a slightly higher level and kind of shunt that into a particular policy or not That sounds a little bit more complicated, but one of the really hard tear kind of thing Yeah, one of the really hard problems on this is that Most of many times when Swift receives a request we don't actually know how big the object is going to be Because they can just simply start streaming data with like a trunk transfer encoding without a content length And then the data is done when they stop sending data And so there's some things for example We could say that if we read the entire body within that segment size great Maybe we'll just spew that out not erasure code it But if it spans more than one segment size then we'll go ahead and erasure code that out Which actually makes a lot of sense as far as where you start to see some of those those crossovers in the graphs as far as what's what's actually going on so in this case if you're using the defaults potentially less than one megabyte would just be replicated out and then potentially the over one megabyte would be Erasure coded and you could change that based on that particular segment size now, so that's one idea I don't know if it's particularly good idea, but it is an idea and I'm sure there are lots of other things that we'll we'll figure out as a community So again to echo what Paul said we've had a fantastic community around this I see many of you in here right now who have really helped out So I wanted to say thank you to everyone who has done that helped us testing Helped us writing this over the last few years Swift community is really great If you would like a Swift shirt you can get one in the marketplace today at the Swift stack booth Many people have them on they're pretty awesome. Just come back to code you get one for free. Yeah, so Thank you very much