 Hello everyone. My name is Alexey Stuknikov. I'm a technical support engineer at Red Hat. I'm helping our customers with their OpenStack and OpenStack site deployments. Two years ago I worked as software maintenance engineer at Mirantis. I backported the patches to stable branches of Mirantis OpenStack product and at some point I faced an issue with Swift that was related to Erasure Codes. Like every infrastructure engineer I was aware of different levels of technology behind how to use those rate levels to solve different tests. But those Erasure Codes problems that I faced were kind of difficult to me. I tried to use the verb to find relevant information. I found a number of documents provided by some storage vendors that were either ambiguous or describe how to configure Erasure Codes for this certain storage but didn't tell much about Erasure Codes themselves. Then I have tried to find the articles that describe some scientific articles that describe how Erasure Codes work in a nutshell. And again I have failed. Some articles were ambiguous, didn't contain the full information, others were so technical that it took for me a few days to understand what's going on under the hood. At that point I have decided to create a presentation or maybe write a paper that will help people in the situation to quickly overcome the limitation, to quickly get up with the technology and proceed with their tests. So today I'm going to say a few words about modern storage systems. I will try to explain why do we need Erasure Codes. I will provide a high level overview of methods behind Erasure Codes and provide the future references for those who want to know more about them. I will also explain how Erasure Codes are implemented in its own storage just to understand how the whole infrastructure solution works. Currently there are numerous specifications for storage systems. One of them is Erasure Codes. It states that there are block storages, file system storages and object storages. Basically for block storage client use ISCASI, Fiber Channel or some other protocol to exchange block rights and read requests with some block devices that are hosted on some remote storage system. From operating systems point of view, block storage is just another block device that is accessed in pretty much the same fashion as any local hard drive. And the block codes that I exchanged with the remote storage system are pretty much the same as the codes that I exchanged with local hardware drives. So file system storage. Generally clients are using some special operating system driver or software to mount remote file system locally. From client's point of view, he don't need to worry about all underlying hardware and he just exchanges file system requests with remote storage, object storage. This is the latest invention. Client will use API or some storage specific protocol to exchange codes with remote storage. From client's point of view, he should only care about object profile name and its location on remote storage. He don't need to worry about some file system obstructions, block codes and other things like that. Ten years ago when I started my IT journey, the only enterprise-grade storage that was available for the companies was a huge proprietary hardware storage shipped by different vendors. They used proprietary close source software that was executed on top of proprietary hardware to achieve different goals. Generally speaking, those that was quite reliable, properly documented hardware but it was very expensive, not very flexible and still have scale out issues. To overcome these limitations, a software defined storage concept was introduced. Basically, a software defined storage systems allow you to manage the data independently of underlying hardware. It means that you could build your storage system using some mid-levels or even low-end servers that will be quite cheap and depending on your storage system it could be very flexible. So pretty much everyone could build a petabyte scale storage in his basement. Software defined storages are generally limited to file system storages and object storages. Since there are generally no hardware limitations, software defined storages can be very flexible. This means that operator can define storage policies in very detailed and deep fashion. For example, it could set the way the data is distributed across your storage. It could select where exactly to put certain types of the data. It could set a comprehensive replication or other data high availability policies. Modern infrastructure engineers could face knowledge issues when they deal with such kind of storages because they require another level of awareness about what's going on under the hood. Besides providing an access to the data, storage systems should also care about data reliability and its high availability. Generally speaking, there are two main ways to store the data reliably. It could be either replication or erasure codes. So let me explain why we generally need free replicas and why this kind of replication is safe for modern software defined storages. Let's imagine a software defined storage that is deployed on storage nodes that are not very reliable. At the same time, those storage nodes are hosted in a good data center. It means that you shouldn't care about network outages, power outages or any other issues. Basically you should care only about the node failures. So let's say that you have calculated and estimates for your nodes uptime. And you know that your node will be up 99.9% of the time. If you will not replicate your data, then the raw estimates will tell you that the data will be lost in one year. So after some node is failed and the data is lost, you will not be able to serve client requests. If you will use two replicas, it means that you will save two instances of the data in your storage. And if your storage system will probably recover that data, then there is high probability that you will not lose your data in some reasonable amount of time. Storing three replicas is generally considered the solid and reliable way to store the data. So if replication can solve this issue, why do we need the ratio codes? Well, the problem is that replication is a simple and reliable way to solve data reliability. But if you want to store some cold data like full database backups or maybe some video files from recent conferences or images or stuff like that, then you probably wouldn't want to pay the money for proper replication. That's a use case for the ratio codes. Basically the concept behind the ratio codes is simple. The original data is separated into segments and encoded. After the data is encoded, you get a number of additional segments that contain special checksums. Those segments are independent from each other. Each segment can be calculated independently from original data. The math behind the ratio codes allows you to lose a number of segments of encoded message and restore original data without any problem. So let me provide a brief overview of the ratio codes terminology. So basically the operator defines a number of rules. The operator or the ratio codes driver developer defines a number of values. First of them is K. The K is a number of data segments in encoded message. In this case K is equal to 3. There are 3 data segments in encoded messages. There could be a number of additional parameters but the simplest case requires only one. It's M. M is a number of parity segments in encoded messages. In this example M is also 3. Generally speaking, the operator or developer could set a number of additional parameters like encoding seams, number of additional parity segments and stuff like that. But generally we'll work with those two. So the process of transforming original message into encoded one is named encoding. The process of translating encoded message with or without some segments to original one is called decoding. The math behind the ratio codes could be used in a way to detect the errors. But generally speaking modern storage systems don't use a ratio codes in that way. That's why the failure model for a ratio codes is a ratio. It means that a ratio driver will not check for data consistency. It will simply calculate the original message using existing segments and will not restore any errors in decoded message. So there is no difference if some bits from G1 segments are lost or the whole segment is lost. Let's check the math behind the process. The simplest example where we have two data segments, two parity segments and L is number of words for each device. The thing is that ratio codes don't deal with whole data segments. They deal with small subset of data segments named word. Generally the word size is some number of bytes. Here is a simple example. Let's say that we have original data that contains two segments, G1 and G2. There is an L word in each of those segments. And we want to calculate two parity segments. The math behind the ratio codes state that there is function F that allows you to encode original message in a way that you will get a message that contains two data segments and two parity segments. At the same time each parity segment could be calculated using functions F1 and F2 in a way that first parity segment will be calculated using function F1 from both data segments and the same for second parity segments but you need to use function F2. Those parity segments are independent and basically that's the translation that we will use on the word level. One step further. Generally different kinds of matrices are used to transform original message. The matrix that the whole matrix that is used to transform original message is named distribution matrix. Here it is. Distribution matrix is generally contained two submetrices. First one is identity matrix. Identity matrix is a general name for this kind of matrix that generally contains once on these cells. Coding matrix is a special matrix that is calculated using special algorithm like Van der Monde or Cauchy algorithm. Every element of coding matrix is calculated in predictable fashion so that coding matrix will have an inverse matrix that will be used for data restoration. And because erasure codes are used to restore and to decode and encode words not real numbers we need to define operations over Galois field. So generally speaking we just multiply distribution matrix to original message that looks like a matrix with one column and we get encoded message that contains. Original message and parity checks. And the final step here is a practical example for first three words of data segments distribution matrix and the result. So let's name the distribution matrix A, the original message D and encoded message E. If we will lose some segments in encoded message we could just remove those rows from the equation and get modified A, D and E matrices and we will still form a valid equation. So it's this equation in simplified form. Then we can compute an inverse matrix and multiply both sides of this equation to this inverse matrix. With this multiplication we can simply remove this thing and we will get this result. Basically it means that we can compute original data using modified inverse matrix of original distribution matrix and modified encoded message. So the concept behind erasure codes is that simple or complex I don't know. So basically every erasure coding driver will first restore data segments and then it will recompute parity checks if it will need to recompute data. I have tried to include more detailed examples of math and code behind it but because all computations should be performed over Galois fields and because it generally takes many additional libraries the code and math wasn't very readable. So I would like to provide these two references for those who want to learn more about practical math and coding examples. James S. Plank is a scientist and a developer who wrote derasure software. This software is in fact a default choice for most storage systems today. The first paper would be used to get practical math examples and explanations how Galois fields and matrices would be used to encode and decode data. The second paper contains practical software examples written in C that will be useful for those who want to build their own prototype of erasure codes. So what does it mean for real life storage systems? As I have said, storing three replicas is considered safe. It means that you basically can lose some number of nodes and if you properly restore the data you will be able to serve customers with it. The production erasure codes are designed in a way that you could lose even more nodes and still will be able to restore original data. For example, there is a public object storage as a server provider named Backblaze. They have their own block where they write about their infrastructure details. They claim that they use erasure codes to reliably store the data in their production cluster. They generally store 17, for every original data message they store 17s data chunks and free encoded chunks. Basically it means that they can lose three nodes and still serve customers requests. At the same time, the size of encoded data is only slightly bigger than the original one. Red Hat also provides erasure coding software. Red Hat Erasure Plugin is designed to support detaining values supported for Red Hat Erasure Plugin. Strictly speaking, in the worst case you will get the same level of data reliability by using slightly bigger size of your storage. At the same time there are a number of issues behind using erasure codes in production. Since data segments are distributed among your storage, you will get additional network loads and delays when you will try to rebuild the data for serving customers requests. Your storage performance will probably degrade after some nodes will fail. If some node fails, then you need to decode original data to serve customer requests if some data chunks are lost. So obviously there is additional CPU load for encoding and decoding operations. It is generally hard to implement partial writes for erasure coded software defined storage. Because you need to recompute coding segments for original data if you will write a small update. For example, I know if a client will add a few strings to some text file, you will need to recompute, in the worst case, some words of coding segments. In the best case, some words of coding segments. In the worst case, you will need to recompute all coding segments. So partial writes is one of the challenges of erasure coded storage. As I have explained, an erasure code EC driver will only encode and decode the data. But there are a number of tasks that should be solved by software defined storage. For example, how will you enforce certain EC parameters in your cluster? What nodes should perform calculations, signature data to serve customer requests? How should you restore lost data segments? How should we detect data corruption and storage failures? How should we implement additional abstractions like failure demands and other things? So all those tasks should be solved by software defined storage. I will use self-storage as an example to show how these tests were solved by self-developers. Self-storage data in pools. Pools are abstraction layers on top of your storage. For every pool, you can define various parameters. For erasure coded pools, you could select proper driver. There are a number of drivers available for self-storage. It's Gerasia, ISA, pretty much the same as Gerasia, but optimized for Intel processors. Locally repable code that is used to optimize some calculations behind erasure codes. And S-H-E-C is basically the same concept as locally repable code developed by our other vendor. So you also can set basic settings for every EC driver. You could set a number of data and parity signals. You could choose a failure demand. For example, you may want to distribute your data among separate nodes, among racks, or even data centers. You could configure coding techniques, basically select the type of coding metrics. So this could be done for every pool. Self will automatically select the nodes to store the data. It will use a cluster map and object names to build an ordered list of storage nodes that should be used to store this data. First node in this list will be used as primary one. Primary node will be used to decode and encode the data. It will be used to distribute the data among other storage nodes. It will be used to rebuild original data from segments to serve customer requests. As I mentioned before, erasure coding doesn't fix errors in encoded data. That's why self periodically checks if stored erasure coding segments are consistent or not. And monitors other storage nodes availability. Here is a practical example. Let's say that we have an object named naen with the following content. We have a self storage pool with free data segments and two parity segments. We want to write this data to this pool. At first point the self client will find proper mapping for this object. It will detect the nodes to write the data. So it will get this kind, this ordered list. And the first node in this list will be used as primary one. So client will send this object to primary node. Then primary node will use jarejo plugin to encode the data. As a result it will get a number of segments. And then it will distribute those segments among other nodes. So first three data chunks will go to first three storage nodes. Two parity chunks will go to the last two storage nodes. I took this example from self documentation and it's pretty accurate. But there is one issue behind it. As I have mentioned before, jarejo code plugins do not operate on whole data segments. They operate on words. And there are a number of optimizations to calculate parity chunks properly. One of such optimizations is data striping. The thing is that a segment of encoded message by default should be four kilobytes in size. As a result the original data will be added by primary OSD and the content will be slightly different. The first segment will contain the whole message and the number of padding bits. And the first two data segments will be purely padding bits. So slight change but could affect those who want to restore some data manually. So let's check how the data is restored. First, self will use its internal algorithm to detect OSD failure. It will wait for some interval and recompute cluster map. Then it will collect encoded chunks that are still available. In this example it will get a data chunk from second storage node and it will get a parity chunk from fourth storage node. Then it will use those chunks to decode original data. Then it will check the modified cluster map to get new nodes for computed data and parity chunks and transmit data chunk to OSD6 and parity chunks to OSD7. So this is a basic example but it's very close to the things that happen in production environments. Previously I have mentioned that there are a number of drivers for erasure codes in self storage. I have mentioned that one of them is locally repairable code. What's the difference between locally repairable code and the erasure plugin? Well, locally repairable code will compute additional parity chunks for a subset of chunks in encoded message. So let's say that original message was computed into four data chunks. These are those chunks and two parity chunks. Those are those chunks. Then to simplify restoration process, locally repairable plugin will also compute additional parity chunks for a subset of segments in encoded message. So additional parity chunks will be computed for every three segments. As a result, when some of those chunks will fail, primary OSD will not request all six chunks from other nodes. Instead, it will use existing chunks in every segment to recompute other chunks in the same segment. So references for those who want to learn more. Generally speaking, the first thing to check is a number of articles by James S. Plank. He is the person who developed the erasure plugin and his papers contain a very deep and good observation of how this plugin works. There are a number of documents for different storage systems that explain how to configure erasure codes. In my opinion, the best I've seen is set documentation. It contains a lot of practical examples, describe how the data is distributed, how it will be restored, and explains the various actions the set will take after certain conditions occur. It's also worth mentioning that pretty much everyone could deploy a self-cluster using virtual machines and check how the data is stored in his environment. So it would be very useful for those who want to learn more about erasure codes using their notebook or official computers. I could also recommend a look of a bad place company. It contains a lot of interesting articles about the infrastructure. For example, they explain how they built the hardware nodes, how they overcome certain limitations or certain failures. It could be very useful for those who want to improve their knowledge about infrastructure behind storage systems. So that's pretty much it. I could answer some short questions. Sorry if it was too fast and too informative. Well, first of all, it depends on the plugin you use because different plugins treat those parameters differently. As a result, you may face certain performance issues if you will not select them properly. The other issue is data failures. So those numbers basically define the way your storage will come after failures. So if you will, the number of priority checks is basically the number of nodes that you can lose. So in this example, the public storage provider decided that they would like to store a huge number of data segments and a relatively small number of priority segments to provide proper data reliability by using only the small subset of their storage to provide those kind of things. So a number of concerns behind those parameters. But generally, it's computation performance and the failure models that you have for your software defined storage. Thank you everyone.