 All right, welcome everybody. Thanks for coming. Boy, tough act to follow watching Spaceballs right into a Swift Erasure Code talk. All right, so I'm Paul Luce. I'm a software engineer at Intel. I've been there a little over 20 years, 21 years. I forgot a year there for a second. I'm focusing mainly on storage and storage products. For the last year and a half or so, I've been pretty heavily involved in the Swift project, working on storage policies and now Erasure Codes, and I'll let Kevin introduce himself and then we'll get started. So I'm Kevin Greenin, I'm an engineer at Box. What brings me to the Swift community is specifically Erasure Codes. I've spent, geez, maybe like the last eight or nine years working in Erasure Codes, a bunch of that in research, and I contribute to a handful of open source projects around Erasure Codes. And one of the things that I wanted to do was basically provide a Python library so Swift can take advantage of some of the lower level C libraries. So a lot of where I come in is basically figuring out what that library needed to look like, and also helping the Swift community figure out what the architecture should look like. But for the most part, I'm kind of the library guy. So what I'll be going through today is kind of a semi-tutorial of Erasure Codes and just a quick little advertisement for some of the libraries. Okay, so we don't have a remote clicker working, so Kevin is going to be my clicker and I'll be his. So this should be fun because everything is animated here, so we'll see how in sync we can stay, so go ahead, Kevin. Okay, so the agenda, I mean, the good news is we've got a ton of really cool material and the bad news is we've got a ton of really cool material. It'd be tough to get through all this, but we are going to do our best and we'll certainly take questions and answers either during the talk or at the end. But the way we're going to proceed here is I'm going to run through an overview of Swift and give everybody sort of on the level playing field. I mean, super brief overview. There's plenty of other talks this week and resources if you want to learn more about Swift. I'm going to focus a little bit on storage policies and you'll see why because it's a fundamental building block that got us the ability to do Erasure Code. And then Kevin's going to go through the Erasure Code section and this is really where the title comes from, taking the mystery out of Erasure Codes. You're going to learn a lot more about Erasure Codes than maybe you wanted to know, but some really interesting stuff from Kevin. And then I'll wrap up with an overview of what the implementation looks like in Swift at a high level. We've got a design session on Friday. If anyone wants to find out a little bit more about the details, what we're doing, where we're going to have it done, all that kind of stuff, we'll cover that at the end. Okay, so Swift, everybody knows Swift. I'm assuming that everybody knows. It's one of the core first two projects in OpenStack. Truly a community project. It's a 100% Python, and the reason I mentioned that, I think everybody probably already knows that, is because we have had a few questions, as we've been working on this, about the performance of Python with Erasure Codes. So Kevin will talk about that in his talk. The actual Erasure Code implementation, the math is not in Python. We're actually relying on one of the external libraries that Kevin and some of the other community members have put together. Go ahead, Kevin. Really good, vibrant community. It's been just a pleasure working with these folks. Kevin and I are really representing the work of a lot of people on the Erasure Code work. It's not simply the two of us. A lot of folks from Swift Stack, a couple of my colleagues from Intel, from Red Hat, IBM, HP, Rackspace. The list goes on and on. Everybody's really contributing to this. It's been a fantastic project. And then another common question, besides the Python versus C thing, is related to this timeline and why I put this on here. You can see, we started talking about Erasure Code a long time ago. And I've been asked a couple of times, geez, what the hell is taking so long? You guys are still talking about Erasure Codes? Well, this timeline, I won't walk through the whole thing, but you can see sort of the middle, we took a diversion from our Erasure Code efforts that we started talking about back in Havana time frame for this little thing called storage policies that ended up taking good eight or nine months of lots of collaboration, lots of effort as a prerequisite to get the EC work done. So it's only been the last few months that we've really been able to refocus resources on the actual Erasure Code implementation. Okay, so some of the key points that you need to understand about SWIFT to really get what we're talking about with policies in Erasure Code is what I'm gonna cover here on the next couple of slides. Again, it's not intended to be a SWIFT 101 type class. There's way too much to get into for the short amount of time we have. But one of the key points in understanding is the SWIFT container model, right? Everything is grouped by model of container. So we have like attributes of common objects associated with a container. I think everybody also knows the RESTful interface, right? There's nothing that we're doing different or new around Erasure Code or policies that really changes into the API. And built upon standard hardware and highly scalable, right? Also eventually consistent, right? With a focus on availability and partition tolerance, right? So those that are familiar with the cap theorem and the really can appreciate what we're doing there. So also what SWIFT is not is kind of a good thing to point out as people sort of try to figure out exactly what this is, especially in this day and age of unified storage systems and software defined storage and all that kind of stuff. SWIFT was not set out to be everything to everybody. It is a very focused project with a very focused goal of object storage in an eventually consistent model. So it's not a distributed file system. It is not a relational database. It is not a NoSQL data store. And it is not a block storage system, nor does it want to be any of these things, right? It is an object storage system set out to be the very best object storage that it can be. Okay, so let's jump into the really sort of 50,000 foot view of the software architecture. And I'm putting these slides up here to give you sort of an idea of what the components look like, the breakdown, and then at the end I'll sort of touch on what we're touching and what we're modifying to get Erasure Code bolted into this thing efficiently. So starting at the top here, we've got the proxy layer. And at the proxy layer, at the proxy servers, you can see the sort of the bluish colored stuff is the non-swift core stuff. Everything is a whiskey model. And then the whiskey application running in the proxy server you can see up there is our proxy application. And it's really broken down. There's a few more classes in this, but these are obviously the big primary main ones. An object controller, an account controller, and a container controller. And each one of these has probably fairly obvious responsibilities to manage the type of object that they're described by. And then there's a series of helper functions and some other things. But that's kind of what the proxy looks like at the high level. Now down here on the bottom we've got the storage nodes. And there are multiple applications that run on the storage node and there's different ways you can slice and dice and distribute these things. And we sometimes call these servers, but I've got them listed here just as whiskey applications. You can see the first one is the Swift object whiskey application, which is, you might guess, responsible for management placement of objects. Some of the main classes in there, the replicator, the auditor, the expirer, and the updater. Then moving on down the list, the account application has responsibilities very similar to what the object does, but for account databases. So it has a replicator, auditor, a reaper and some helper functions. And then on down to the container with very similar functions to what the account application runs. Okay, so now I'm gonna jump into storage policies a little bit. This is actually a slide from a talk that John Dickinson and I gave in Atlanta. So if you're curious more about the details of storage policies, what we did and why, that's still online. I encourage you to go Google that and find it pretty quick and see what it's about. But I wanted to explain sort of why we did it because it wasn't just for erasure codes, right? It's really fundamentally buys us a lot more than erasure code, but was a prerequisite for doing the EC work. So why do we do this? So without erasure codes, for one reason, you could pick a different replication level for your cluster, but it applied to the whole cluster. So if you had needs to do lower durability for some applications and higher durability for others, you pretty much had to deploy multiple clusters if you wanted to do that before policies. Another thing that we added erasure codes for was to provide the ability to differentiate different hardware or software services within your cluster and bubble them up so the application can see them and take advantage of them, right? So if you have certain nodes in your cluster that have special capabilities or higher performance without the policies work, those were kind of lost in the fray, right? It was difficult to really exploit those. And then finally, the one most relevant to this talk, no real easy extensibility framework for adding things like erasure code, right? So Swift fundamentally, really everything around Swift is built around replication and making copies of the same object and the assumption that the same object is everywhere. And that's all very, very different with erasure codes. So we needed some way to compartmentalize and to extend and to do something different, right? Beyond just making copies of things. So one really big fundamental thing in understanding how policies work and therefore how erasure code is bolted in is to have some better understanding of what the ring is. And the ring is, I'm not gonna do it justice in one slide, but I did wanna give folks an idea of kind of how it works, what it is. So we've got a picture of a ring here. This is partitioned into what we call partitions, which not to be confused with this partitions, right? These are sections of the namespace of the NBF5 namespace. The ring is, it's actually a data structure that's isolated, separated and maintained outside of the operation of the Swift cluster, right? So there's a set of tools that are provided that you build this data structure and then distribute it amongst the nodes in the cluster so everybody has a common view of where things are supposed to go and what the rules are for placement. But the way it works here, as you can see, we feed a URL, the complete URL to the object or the account container that you're trying to reference. We feed that into an MD5 and that spits out an index that we use to get into a particular partition in the ring. And the ring is effectively this big straight data structure with two tables and a couple other miscellaneous elements. But the tables are actually pretty easy to understand. There's one table over here that's simply this long list of devices. And by devices, I mean hard drives, right? So every hard drive in the cluster is represented in this table. And then we've got this other really hard to pronounce name for a table that's really sort of the magic of the mapping. So go ahead and click again there, Kevin. So when we get that index out of a partition, we use that to look up in this first table and that gives us a row in the table. And from that row, click again, I should like hit the button and you can see me hitting it and then you can, by work better. From that, we get a list of indices into our table of devices. And that's how we identify from an object name what are the relevant hard drives, right? Regardless of where they're at in the cluster because they all have IP addresses associated with them that we go out and talk to for that. So this is sort of, like I said, the brief overview of it and if we click one more time, you'll see the real magic here is what you put in this table, right? And there's, like I said, a whole set of tools that defines what goes into that table that has to do with the regions and zones and making sure that your fault tolerance is the way you want it and that you don't have a single point of failure, right? All that kind of good stuff. All of those are really separate from the basic concept of doing an MD5 hash to look up an index, right? That's where the real magic happens, right? And all that happens, like I said, is you can offline with some tools that build this ring for you. Oh, excellent, thank you. So what are policies then? At the very simplest description, policies are multiple rings for objects, okay? So let's click a couple of times here and see if we can speed some of this up. So you can see, here's some examples, right? Here's a ring for triple replication and it has, as you might expect, three locations. And when you get three locations, that means three copies. Here's a replicated ring. Reduced replication with two locations. So you get two copies, right? That's what those things mean. Now, here's where sort of the semantic change of what a definition of a location on the ring means. We introduce the ratio codes. Now we've got n locations and they no longer represent copies of the data. They now represent locations for fragments of the data. So we're actually get to leverage, so far, 100% of the ring code. We haven't touched any of it to do EC work, right? Because we're just changing the definition of what it means to get an object location back from the ring. Second big piece of policies besides introducing the ability to have multiple rings and having their locations mean something different semantically depending on the policy is this new one piece of metadata that is applied to the container. So that's why I had the container as sort of one of the key points to understand in this super quick Swift overview. This one new metadata tells the rest of the classes within all the code which object ring to use for all of the objects within that container. So now you can now, and since this is out, I don't have to say, soon you will be able to. You actually can today, right? This was released in the summer at his part of the Juno release. So you can have one container that everything you put in it is only double replicated and the application just specifies a different metadata tag, writes to a different container and that one's just magically triple replicated. And then as soon as we get done with the EC work, same thing goes there. So it's very easy for an application to switch durability policies. In fact, once the container's created, it's completely transparent. You don't have to do anything. So if you're doing performance work, for example, and you wanna see what's the performance difference between triple replication and EC, you use your performance tool, create one container that's triple replicated, one that's EC and run your tests and just write to a different container and rerun your tests. And you've just changed from triple replication to EC just that easily. Okay, and then sort of putting it all together and we'll see another slide like this at the end where you'll see how the EC stuff works. This is just a simple triple replication example, also a little more sort of Swift 101 context here. Let's go ahead and click through the animation here. You can see this is an example of a put. So an object comes through, hits the proxy. Proxy uses the object ring to look up the three locations, finds the three locations and sends the three copies out into the storage nodes. And then when the application wants to get it back, the proxy does a lookup. It picks one of those three locations. There's a couple of different rules we have for how that works depending on what the use case is. But it will retrieve one of those copies and shoot it on back up to the application. Okay, so I'm gonna turn it over to Kevin now and we'll go through the Erasure Code tutorial piece and then we'll get back to Swift again. All right, I'm gonna be my own clicker. Oh, really? I'll just make it easier. So you can just get out of the way. Yeah. All right, so like I said, I'm kind of the token library guy on a few other projects as well, but me and another guy, Tushar, who also works on some of the Swift stuff, basically created the Python library that Swift is gonna be using for Erasure Coding. And like Paul had said, it's the only Python piece to the library is basically the Python, the CAPI. The, as I'll show later, the library is actually pluggable and we support a few C libraries today. One's J Erasure, which is one I also work on. Another's IsaL, which is Intel's accelerated library. And then it just has support if you wanna plug in your favorite you C library, you can use it. So basically what I'm gonna do, I'm gonna go and actually give some background on Erasure Code, just click background. And actually show an example, encoding and decoding of what is basically the most ubiquitous Erasure Code out today for storage, which is Reed Solomon. And one of the goals I had for this was to sort of crack open the black box a little bit. And if I get through to like two people, I think it would be a success for me because I don't wanna scare anyone away, but I'm gonna go through some math and I hope that it's at least presented in a way that's understandable. And for those of you who aren't super familiar with Erasure Codes, I also have some higher level stuff. So hopefully everyone kind of walks away from this with something. So after I go through Reed Solomon, another big topic today in storage and Erasure Codes is minimizing reconstruction cost. And there's interesting classes of codes that actually let you do this that aren't necessarily Reed Solomon. So I'm gonna go through a pretty simple example using an X or only code to kind of just give you a taste of the kind of things you can get to reconstruct, to minimize reconstruction costs. And at the end I'm gonna quickly just go through a few of the libraries, two of the libraries Swift is using to implement Erasure Codes. So Erasure Coding's not a new concept. It's been around for a really long time. Back in the 60s, a couple of guys came up with Reed and Solomon, came up with Reed Solomon. It was primarily used for network reliability. And then eventually it ended up being used on mass storage devices such as disk drives and CDs. It really wasn't till the late 80s and early 90s when Reed really became a bigger thing that Erasure Codes started to be used more across multiple disks. So Reed 5 is an example of a scheme that uses Reed Solomon, Reed 6 is the same. And then there's a bunch of other codes other than Reed Solomon that are used for Reed 6. And recently, I'd say in the 2000s, and actually up until today, there's been a lot of focus on the so-called regenerating codes and these are the codes that are really good at minimizing reconstruction costs. So in a setting where you have thousands of disks and disk reconstructions are kind of the norm, one of the primary goals of these codes is to minimize the amount of traffic you have to send over the network to repair a failed drive. Reed Solomon is kind of the worst case for that because you have to, it can get pretty expensive when you're reconstructing stuff. So I figure I'd start with terminology and hopefully this would be the kind of thing that would be useful to people that maybe aren't as familiar with Erasure Codes. Most of these terms are actually lifted from some work Jim Hafner did at IBM in the mid 2000s, I think it was. He kind of set, and a couple of his papers had a, like did a really good job of just sort of laying out terminology. So I figured I'd just go through this really quick here. So what does Erasure Code do? Basically, all it really does is you take an object, you chop it up into K pieces, so that we're using K here, and you put it through a function and n things are gonna come out where n is greater than K. And m things are gonna be what most people call the parity. Does that make sense to everyone? And feel free to stop me at any point if you need a clarification or you're confused about something. I wanna make sure everyone walks away with something from this. There's two main types of Erasure Codes, systematic and non-systematic. Systematic basically means the encoded output symbols, so those m things contain the original data. So if I use a systematic code, if I have, if I can get the K data pieces back, I don't need to do any decoding, I can just read it. On the other side, there's non-systematic codes, and that is when the encoded output does not contain the input symbols. So in that case, when you get K or however many you need to gather, you need to actually put it through a decoder to actually get the original data back. Typically in storage, there's a few examples of using non-systematic codes, but in general, we use systematic codes. Most RAID schemes use systematic codes just because all you have to do is if you get the original data back, your work's kinda done. What we call a code word is a set of related data and parity, and these are related via a set of parity equations. So as an example, I have a function, I'm gonna feed the data, and it's gonna spit out a code word. And in the systematic case, I get data chunks and parity chunks. There's also layout, so flat horizontal basically means each encoded symbol is mapped to a single device. And this is, again, typical of traditional RAID geometries. Well, I guess actually not necessarily. For, in some cases, it's typical. There's things called a RAID codes that have both horizontal and vertical layouts. So if you've heard of something called even odd, out of IBM, RDP, out of NetApp, those actually have both horizontal and vertical layout, which means many symbols from the code word can map to a single device. And finally, there's MDS and non-MDS codes. MDS means maximum distance separable, and basically all that really means of the end chunks, all it takes is any K to recover the original file. With non-MDS codes, K chunks might not be sufficient to do that, so you actually might need a little bit more to recover the original file. The one thing you get from non-MDS codes is, like I was saying before, you can optimize for reconstruction. So it might be more expensive to recover the entire file, but it's actually cheaper to recover a single chunk. And I'll go through an example later, it sort of sets that. I'm gonna skip over this to save on time, because it looks like we're a little bit over, but there's all many variations of codes. So I'm gonna walk through a simple example using Reed-Solomon systematic horizontal layout. So in this example, we have N is eight, so we have eight total chunks that we'd be writing out. We have K data, or sorry, five data in three parity. So in this case, the nice property Reed-Solomon is, it can tolerate the loss of any M things, so any N minus K things. And the overhead is actually also optimal, so the overhead is just gonna be an over K. And we have any questions? Is this like, does everyone already know all this stuff, and I'm just sort of going through stuff, everybody knows it? No, yes, okay, awesome. So just a quick example, let's pretend these are disks, and let's pretend that we've sort of broken up the disks into bite-sized elements. So there's gonna be much more than this. So just to kind of let it set in some of the terminology, a code word is, like I said before, a related set of data in parities. Okay, so here's where things are gonna get real. So there's gonna be two slides of math, so I hope nobody starts walks out, they're like, screw you, I don't wanna go through this. And the main reason I do this is, usually racial codes are treated as a black box, you just feed stuff in and you get stuff out and everyone's happy. I don't really like treating things as black boxes, I really like to understand what's going on behind the scenes, so I'm just gonna try to do this with as simple of an example as I can come up with. So in its purest form, Reed Solomon is basically just oversampling a polynomial. So basically in the example I was saying before, since we have n total things, you're basically oversampling a single polynomial n times. So does everyone remember what a polynomial is? I hope I remember what a polynomial is. Okay, awesome. So in the case of the polynomial we're oversampling, the coefficients are the data, right? And I'm gonna sample this polynomial at n points. So we're gonna go ahead and map this over to a matrix, I hope I still kinda have everyone here. So where each row is an evaluation of the polynomial. So in this case I have five data and n total things. So this is a non-systematic Reed Solomon. So we need to map this over to a different matrix to make it systematic so that we actually have some of all the original symbols in the output. So another thing I should mention is a property of this is any k by k sub matrix to this is invertible and that becomes important during the decoding step. So if you remember from linear algebra, if you took linear algebra, I can take this matrix and do elementary operations and the results gonna have the same rank. So rank basically is the same thing as what I said before, any k by k sub matrix is invertible. So I can transform this matrix into that and now I have a systematic Reed Solomon code. And I'll map this in the encoding example so it'll make more sense. Am I still holding, is everyone holding on here? Okay, so let's go through the encoding process. So we have our systematic matrix here and that vector there's our original data and all we're doing is multiplying our original data by this matrix to get our code word. And as you see here, when I do the matrix vector multiplication, I multiply that vector by that first row, I just get D zero. So I get the data in the clear. So that's basically what makes it systematic. A few things to note, all operations are done in a Galois field, I could go through that today but I think it's well beyond the scope of this talk and it was even more mathy than any of this stuff. If anyone wants to, and I guess here's another thing, if anyone wants to talk about any of this stuff at a whiteboard during the week, just grab me and I'd love to talk about this if you do want to know. And like I said before, with the systematic matrix, any K by K sub matrix is invertible and that's going to become important when we decode. So suppose we lose three drives. So in our code word, we've lost those three things. How would we decode to be able to get the original data back? Well, like I was saying before, any K by K sub matrix is invertible so what we can do is we can remove three rows from this matrix. The result's gonna be a five by five matrix. I invert it because it's invertible and if you remember again from linear algebra, if you multiply the inverse by the original, you're gonna get the original data back. So basically what we do is we take the remaining elements multiplied by the inverse and that's gonna give us the data back because this equals one. Does that make sense? All right, so that's what we just did there, right? Boom, decoded. And yeah, so I don't have to go into too much more detail on that. Okay, so now we have an example of how Reed Solomon works. One of the disadvantages of Reed Solomon is it requires K available things to reconstruct any element. So in the example I just gave, every time you reconstruct even a single lost thing, you need K thing. So in that example, you'd need five. Due to this, in large scale systems, I think people learn that over time in a large scale system, you end up having to ship a lot of stuff over the network for doing reconstructions with codes like Reed Solomon. So I don't know, like maybe like 10 years ago or so, maybe even sooner than that, the coding theory community really grabbed onto this whole like minimized repair cost thing and I'm not sure if you guys heard of them but there's things called regenerating codes, locally repairable codes, I think that was at a Microsoft, flat XOR codes, which is some of the work I did when I was at HP Labs. And basically what these do is they trade space efficiency for more efficient construction. So where Reed Solomon was actually, it's space optimal. With some of these codes that minimize reconstruction costs, you pay a little bit more upfront, so you might add a few more parodies so that you actually need to read off less disks when you get disk failures. So I'm gonna walk through a quick example here. Did someone have a question? No, I just heard something then. I'm severely jet lagged, so I'm probably like hallucinating right now anyway. Okay, so let's go through a simple example of a code that does really well local repair. So in the case of this code, same number of data, same number of parity. Aren't it not same number of parity? We have four parity in this case, so we've added a little bit more. And these are the parity equations. So P zero equals the XOR of data zero, one and three. Everyone kind of understands that, right? Yeah, hopefully. So if I lose D zero, I don't have to read off K thing. So in this case, I guess K was six. I actually only have to read off three. And you could also do other interesting things. I have two equations I can use here and I can do really interesting parallel reconstructions as well. And in this case, I end up paying a little bit more in space to actually save on reconstruction. And these codes tend to probably make a lot more sense in an archival use case where you don't have a lot of reads coming in and your common workload is basically just repairing stuff. That make sense? All right. Okay, so quick advertisement here because we're running low on time. So like I was saying before, I kind of started working with the Swift to be more at the library level and it was to create Python bindings for some of the lower level C libraries I was working on. And basically, what came of that was what we call PyEClib. And the goal of PyEClib was basically to just make a pluggable, easy to use EC library for Python. And the primary use case is Swift. There's a few other people I know that are using it for stuff, but Swift is our main, it's the main user of it. So originally we had all the logic of PyEClib. We did everything in the Python C API and that turned out to just be a monolithic, just terror. So we actually pulled all the smarts out into a C library called Liberation Code. Which actually I'll talk about in the next slide. But the basic thing with Liberation Code was separation of concerns from doing language specific translation from Python into C and then Liberation Code actually does all the smart stuff. So it does all the pluggable stuff with the back end libraries. There's some value added stuff like doing checksums and other helper functions. So yeah, it will be a build dependency. Yeah. Oh, so class of Liberation Code is a build dependency and it is. So if you install PyEClib, you need Liberation Code. Oh, it's a separate library. Complete, yeah, yeah, it's a dynamic library. Completely separate. Yeah, my favorite part about PyEClib is I can treat it like a black box. And so can the rest of us work on the Swift side. Thanks, Kevin. Okay, so let's wrap up and see how this ties back into Swift. If you can click for me there, Kevin. Okay, so first a couple of the design considerations. Some of these are probably pretty obvious. Go ahead and click twice so we get the whole thing up here and then we'll make sure we finish. Yeah, right there is good. So I said pretty obvious, good software engineering practice here. The first one actually was a design decision that we spent a good deal amount of time discussing last year with really all the members of the community that are participating. And that was kind of what do we do? The Irish could work in line at the proxy as the data comes flowing through or do we let the data come flow through and then we do it as a background activity on the storage node. So a lot of work involved in looking at the pros and cons and we actually have a prototype of both styles available. And we have gone with the inline version where we do the Irish could work at the proxies. Actually back up one. Oh, sorry. A couple other quick ones is probably obvious. Building on storage policies, the whole reason we sort of got diverted from EC for a while was to do this storage policy work so that we could easily extend Swift with EC. And then here's really the common sense one. Really keep it simple and leverage current architecture. And a good example is when I went over the ring stuff. Like I said, we haven't changed any of the ring code, right? And yet we're doing something completely different with the data as it streams through the system. So that's pretty cool. Okay, now let's flip to the architecture slide here. And this is now with EC. So you can see the sort of pinkish colored stuff is really the main areas that we're touching to get the EC stuff to work. So up at the proxy, really I'd bucketized all of the controller into one box, right? So all of the controller classes are touched with EC, mainly the object side, because that's where the data's actually flown through. But there's a lot of work going on there, a lot of changes. And again, if anybody's interested in the details, Friday is our design session where we're going to dive into whoever wants to dive into what. And then of course, PyEClib running up on the proxies to do the encoding and decoding as the data streams in and out. Okay, now down on the storage nodes at the object server, lots of work at the object server, the three main areas are S-sync, which is an alternate method of doing synchronization, right? So the original one is R-sync, and now the ability to do S-sync is I think we're calling it production as of Icehouse, right? So EC is leveraging a lot of that code, right? It's not doing replication, but it's leveraging a lot of that code and a lot of that logic to facilitate communications between the storage nodes as it decides what to reconstruct and how to reconstruct it. Then we've got this big monster down here called the EC Reconstructor that will undoubtedly be a topic of discussions Friday. That's kind of a beast, so there's a lot of interesting stuff going on there. And then PyEClib with its plugins down here on the storage nodes. So I'm hoping somebody has a question about that piece right there. It's like, hey, Paul, didn't you just say that it was all done in the proxies? Is that the question? All right, good question. Fantastic, you get a free t-shirt. So it is true, all of our EC work for incoming and outgoing data is done at the proxy, but when the Reconstructor needs to rebuild an object, it's done at the storage nodes, right? So anytime there's the need to do a reconstruction, the proxy is not involved. So we actually are doing EC work on all of the nodes in the cluster, taking advantage of all the CPU horsepower that we've got. And then down here, we've got some changes around the metadata, especially with EC, and there's a lot of interesting details there too that I would encourage you to come to our design session Friday and we can get into all the nitty gritty, but there is a lot of devil in the details in the object node implementations. What do you do? Yeah, so Clay's question was, what did I mean by the metadata? Isn't it just the storage policy index? That's gotta be a setup, because I know Clay knows. But an example of some of the metadata changes sort of goes to the concept of Swift is built around storing complete objects, right? So today when you store an object, all of the metadata that is associated with that object sort of travels with it and makes it into the container database all in one, right? There's only one copy of the object. When we talk about a racial code, that object comes into the proxy and now it's busted up into a whole bunch of chunks, right, like what Kevin was showing. Now we've got metadata associated with the chunks as well as metadata associated with the original object. And we've gotta figure out how to store all that stuff efficiently and make sure it's properly replicated for lack of a better term, but so that it's properly represented throughout the cluster and that the container update that goes with that object reflects the object and not one of the fragments of the object. So a lot of little nitty gritty details and tracking of who goes with what and where everything lies and all that kind of stuff. That's what I meant by the additional metadata changes and complexities that maybe you wouldn't think about. You're just thinking, oh yeah, it's just container metadata. But it's, yeah, a lot of it is in the object controller as well as Clay's comment. This slide is a couple of months old. We've done work since this. Okay, so let's see, next slide. And then this is sort of a final, putting it all together. You know, what does it sort of look like? Hopefully you can imagine what this animation is gonna look like. Go ahead and click once there, Kevin. So you can see same as before an object comes in, hits the proxy and now this gets fancy, ready? Oh, check that out. So we've encoded the data and yeah. We've encoded the data and now we spread the fragments out throughout the cluster. Again, based on the rules that were set up when the ring was built with each ring location now meeting truly a location and not a copy of that object. So now we've got fragments scattered throughout all over the land here and the application comes to do a get. Now you can see we're going to get a subset of the fragments. Now we've got different ways we can actually do this and we haven't bottomed out on the default implementation but what I'm trying to show here is we don't need to get n fragments. We can get less than that and still recover the original data. There's trade-offs, whether we wanted to go out and get all n and use a subset depending on which came in fast or whether we want to save on network bandwidth and just get the minimum amount and if we can't get them all we have to wait and get another one. So again, more details there but for the purposes of the example we only pulled back three and then the decoder, go ahead and wait for it. Oh yeah. And now we decoded the data, right? So we're going to reconstruct it and a fine point here is this is systematic encoding so if we've got the data chunks we don't necessarily actually have to decode. We'll still feed those data chunks into PyEClib but it doesn't have to do the math. It'll basically just concatenate the chunks for us. So if we're smart about picking which subset of fragments we get on the get then we minimize the overhead and reconstruction because it's basically a concatenation. Okay, now we've got our for more information links a lot of good stuff here. We do all of our really day to day tracking of who's doing what and what the status of each individual task is on Trello. It's just a pretty easy to use discussion board. So there's the link. It's generally within a few days of being up to date. So anybody wants to check it out or participate, that's the link. We do have blueprints but again most of our work is tracked and done on Trello and then there's the code itself. We are off on a development branch. So we're not in master. We do keep in sync with master but we do all of our work off to the side because it's big and hairy and complicated. The link to PiEC Lib and the link to Liberation Code. So with that I think we're about out of time but if anybody wants to come up and see Kevin or I for questions I think we're out of time. I see a hand but I don't know. Does somebody have a watch? Are we out? It's 308. 308? Yeah. Okay, so we have time for a short question. Go ahead. How to scale up or scale down the Swift ring? Yeah. What was it you had? Yeah, yeah. When you initially create your ring there's some guidance provided in the existing documentation that sort of has you thinking about growth in the future. So you add or remove devices. There's no difference between triple replication and erasure code from that impact. You still have to rebalance. Yeah, you still have to rebalance and the reconstructor in the EC software architecture will take care of the movement of the data after the ring has been rebalanced and distributed to the storage nodes. So it's very analogous to how replication works except instead of making copies we're either just moving stuff around or we're reconstructing and moving stuff around. Good question, thanks. Couple more? So is this microphone working? Is it you? Yep. Great believer in microphones. This is Alex McDonnell from NetApp here. Just a quick question about erasure codes and the math, sorry. But this is actually really simple. Erasure codes assume that when you're reconstructing the data especially from a minimal set that the transmission and the storage of the data is perfect. One of the issues with erasure coding as I see here is we're dealing with imperfect devices although network transmission is pretty reliable device storage is notoriously unreliable and reconstructing for instance a piece of data from a five and three scheme out of five pieces of data you run the risk of actually reconstructing crap to be quite honest. Checksums. And how do you update the checksums at the node level? Yeah, so the client computes checksum. We get checksums all over the place. We sort of through the entire path we've computed checksums from the client in the proxy stored in metadata per fragment as well as per object transmitted over the wire to the storage nodes where it's checked again and stored in the extended attributes of the file system and when each fragment archive is read it's checked against its stored checksum before it's sent back to the proxy to be reconstructed. Where's the checksum now? In which case we have the final checksum on the original object to make sure that matches. In the header. In the header of the fragment. And the same blocks. Plus we have a prerequisite that are only perfect devices to be used. Same blocks or separate blocks from the header. We, I mean it's the same file, right? Okay, thanks. Yeah, yeah. So if somebody didn't hear Clay's comment he just said that yeah it's the same checksum scheme that's used today with replication but we did have to add a few steps here and there because of the fragmentation of the object. Thanks, so good question. Hi, Stephen Lang. So Swift is eventually consistent. How does that change with erasure coding? It does not. So it still, it just writes out what it can and then does it sync the rest of the erasure code and compute it afterwards? It's, isn't the current policy K plus one and then that's tunable as well? Yeah, so you know there's a, I think that's kind of a common misconception about what eventually consistent means. It does not mean that we store some of it and then tell you it's done and then store the rest later. That kind of maybe what happens in some scenarios but really what it means is that the synchronization of your metadata and your data aren't going to be guaranteed all the time. Eventually they will be consistent but just like with triple replication we're not gonna tell the client that a put has worked until we get a successful number of puts that we define as a quorum. So until we know that the data is durable we don't tell the client, go ahead and be on your merry way. So all of the math and all of the putting does happen before your client gets an accept. So it's not, still not strongly consistent but again the terminology is a little sometimes fuzzily used. All right, thanks. Did that answer the question? Yes. Okay, thank you. Okay, thanks everybody.