 about a range of coding but the history of error detection and correction and how that led to a range of coding So thank you for coming And it's not working anymore one second, okay, there we go nice cool, so Who am I so my name is Danny and I work for a soft time in the UK where we're working on building a self-appliance I've mainly worked for Open-source companies in the past so companies like Lin bit and code link and I run a nonprofit in the UK called open source events Where we do tech meetups and conferences mainly around open stack and cloud native And yeah, I like beer So I mainly got interested in this topic a few years ago after I saw a colleague called Jim MacArthur do a lightning talk on error correction and So I wondered how this relates to raid and other storage technologies and I did some digging it figured out There's lots of ways to do this and I thought I'd provide an overview of some of the ways that Error correction and detection is done so the main idea is to have a set of techniques so that when we're transmitting data we can figure out if once we've received the data on the other end it's inconsistent or Even better if we can correct the data Rather than just dropping it. So these concepts have been around since before modern computers were born But they've come a long way. So even the older ideas are still around in some of our modern technology So why do we have errors in the first place? It could be human error. It could be network glitches. It could be something more random such as noise on a chip or Cosmic rays interfering with the chip all these things can Cause us to receive something different or the data to send something different So by a show of hands, how many people have heard of a parity bit? Pretty much the whole room cool So the parity bit is the earliest and most basic form of error detection in computers And it's a very basic type of checksum And the idea is that you can add a bit to the end of a set of bits Which let you detect an error when you transmit the data so we do this by adding up the number of one bits we have and If it's an even parity bit then we make it true for an even number of one bits And if it's an odd parity bit we'd make it true for an odd number of bits So you can see here that in the first Character we have two one bits that we're transmitting so because that's even we'll add a one on the end And we'll send that and if we receive the same thing on the other side We know that no error has transmitted Same again with K an even number of bits so we add a one on the end That's the one in bold and if we transmit it and there's an error or there's a different number We can do the calculation on the other side and see that actually we should have a zero There's obviously some sort of mistake So we just drop that frame and ask the whoever's whatever sent it to retransmit it And same again on the third one. There's another error there and that's an odd number of bits That's why the last the last bit that we're transmitting is a zero so this concept goes back as far as the 50s and We had parity tracks appear on mechanical tape drives in the early 50s the major flaw with this is that If you have more than one error within a byte, we're not going to detect it So it will only pick up One bit error surface multiple bits within the same segment then obviously it's going to be a false positive It's still very useful though, and we use it quite heavily in hardware applications such as microprocessor caches the PCI and scuzzy bus standards and RS 232 serial and lots of lots of other hardware applications use it So that's a very simple way to detect basic errors in transmission, but we can't correct them There's a decimal equivalent to this And this is used in lots of different real-world applications like social security cards ID cards and all major credit cards so this is one of my old credit cards or debit cards and You'll see that the last digit is actually a check digit. So how does this work? I don't know if you've ever noticed that when you're putting your credit card or debit card number into a web form It might automatically know Whether that's valid or not even if you're not connected to the Wi-Fi. It's all happening locally And that's due to this loon algorithm. And so the idea is this guy German-american engineer worked for IBM in the 40s and 50s And he was the first guy that came up with the idea of information buckets to speed up data retrieval and storage and The idea here is that that he came up with with this algorithm is that you double every digit every other digit starting from The right and then some all the digits and they should be a multiple of 10 And if they're not then you know that something's wrong So the last the last digit will either make it a multiple 10 or won't cool, so then we get on to trying to do some correction and The mate one of the main pioneers here is a guy called Richard hamming and he used to he was a mathematician and professor who worked at Los Alamos on the Manhattan project in 1945 and he was on a team which was responsible for programming the IBM calculating machines which had the scientist formulas check that they were correct and A few years later he went on to join Bell Labs And he became known to be one of the young Turks who was a group of people at Bell Labs that all Had heavy contributions to compute science They were all pretty well respected and they didn't have they weren't given the usual responsibilities and structure you know, but measured by grants and teaching teaching and papers they were kind of just left to get on with it and Three of them ended up going on to win Nobel Prizes and a few of them went on to lead Bell Labs But all of them are highly respected Scientists John Tukey was one of the first guys to come up with the name that the term bit which is contracted from binary information digit So while he was a Bell Richard hamming set one of his calculating machines to work on a problem over the weekend He said he said it on Friday and when he came back on Monday, he discovered that due to an error The whole calculation died early on Saturday morning And so he'd have to restart the whole thing again waste two days of work or wait till the next weekend And this was quite common because you had punch cards And so if a car was bent or it didn't punch properly then you know, you're gonna have an error so He then basically went on to perform probably the biggest yak shave in history and decided to sack off all that physics Nonsense and figure out how to get calculating machines to detect errors and automatically retransmit data identified as inconsistent so he published this key paper in 1950 called error detecting and error correcting codes and He described a few concepts that are still core to error detection and correction today and these are Number one is the hamming distance, which just describes how many single bit operations I have to do to correct a Binary number of string. So, you know for foster 19 foster 20. There's two characters different. That would be two But obviously on a binary level This can also be represented geometrically and it's called the hamming cube so he went on to define a system called the hamming codes, which is far more robust at checking errors than standard parity and What he developed was the concept of having three parity bits for every four data bits Which is quite expensive But it means that we can detect errors up to two bits and also correct one bit errors. So how does this work? Let's have a look. So if we have four bits here d1 to d4 and We want to send them over some sort of transmission channel. We can arrange them at the intersection of three circles so 1010 and Then we can calculate the parity for each circle. So For the parity circle one we have one one bit. So that's an odd number. So the parity will be zero for parity For the second circle, we have two one bits. So that's even so parties can be one For the third circle we have again one parity bit So that's odd. So that's gonna be zero. So we have three parity bits We add them to our data and we send that across the channel Now on the other side, we receive something different. So we get it d3 is turned to a zero instead of a one So how do we detect this? In the new circle that we're arranging on the other side You'll see that d3 switches to zero and then we'll recalculate starting with the first parity circle Nothing's changed there. It's still an odd number Something has changed there. It's now Supposed to be an even number of bits, but it's not a one bits, but it's not and the same for paratical three So we know that the offending bit is at the intersection of the second and third circles So therefore it's d3 and so we can just switch flip that bit and correct the error So that's the basic way that hamming codes work So We use parity pretty heavily in rage some of you'll be familiar with rate levels You probably you probably haven't come across rate two because nobody uses it and that's because it's really expensive But this is hamming codes and it happens at the bit level and so you can see that there's four data blocks and three Parity data bits are in three parity bits for on There are striped across these hard drives See I don't think anyone uses this Then you have obviously I'm sure all of you familiar with rate five rate four is similar to rate five There's the only difference is that rather than distributing the parity bits across the drives We put all the parity bits on one drive and that means that writes are slower with rate four since there's less data drives to write to you But since all the parity is written to the same disk Random reads are better Sorry, so writes are slower because the parity has to be written to the same disc But random reads are better because there's less disc for you to look for because you have a dedicated parity disc so That's that's read for this is read five and You're all familiar with rate five. I guess the way it works is it runs an exclusive or on the block. So for every two Blocks that we write we run an exclusive or and then we write the result as a third block And so green here is the is the parity data So if I keep doing that and obviously on different drives, I end up with a whole bunch of data Now if I lose one of the drives, I can recalculate it. It's just it's really simple Just run exclusive or again on the other two Data blocks and I can regenerate that drive So that's our rate five works So that's Hamming codes that was in the 50s then these guys guys came along in the 60s and they changed everything Good staff Solomon and Irving Reed and these guys were staff members at the MIT laboratory and Ten years after Hamming's paper. They published a paper called polynomial codes over a certain finite fields So I was reading up on this and I suddenly realized that this was getting really mathematically complicated I'm not a mathematician. I was out of my depth. So as I always do when I'm out of my depth I Go into IRC and ask a question. So I did that I went looking for the right IRC channel to think to ask in thinking that This would be helpful. I knew there was a bunch of mathematicians on this channel. So I just Just asked, you know, couldn't someone explain polynomial interpolation to me This is the response that I got I don't know maybe some of you understand what that I have no idea but yeah, yeah, I I Need to stop going to IRC for as a resource The main idea behind Reed Solomon does however revolve around this concept of a Galois field Galois field it's named after every Galois who was a French mathematician and unlike our previous heroes had a much more Troubled and interesting life in the sense that he got interested in maths at the age of 14 He published a number of papers by the age of 17 of which one of them was about Galois fields and Galois theory And this was a year after he was rejected from University because his examiner couldn't follow his train of thought and was very confused He was politically active ended up in prison a number of times before he Engaged in a duel to the death with an officer at the age of 20 and died. So not the most glamorous of stories, but Yeah, he did define this concept called Galois field, which is basically a mathematical field a set of numbers Which you can conduct mathematical operations on and a finite field or Galois field is one that is finite And it's always a size of a prime number or prime power And these fields wrap around so you can conduct any operation on the values within the field and always get another value in the field And you do this you can just modular the size of the field Once you get your results. So this can contain numbers it can contain polynomials and it contain roots of part polynomials And how does this relate to a rage coding and error correction? Well The idea is that using a Galois field you can essentially plot Any sequence of data as points on a polynomial graph And that's because the graph itself is representative of the Galois field that you're using Once you've done that you can then find other points on the graph which you can use as parity data So the purple ones are other points that we found on the same graph And that's what we will use as our code charts and the idea is that you define a code and you use that on both sides to recover data and it's highly efficient but Much more computationally expensive So here's an example K is the number of Data charts that we have M is a number of code charts on the left We have what's called a distribution or generator matrix, which is made up of an identity matrix at the top Which is essentially is equivalent to one and then the bottom piece is the Code charts which we generate So we add them together and we multiply that by our data and we end up with a set of data and a set of parity blocks so Let's say we lose some of our data the bit in K We can replace those with C1 and C2 Which are what we what we ended up calculating and we can Multiply that by the inverse of the same generator matrix that we have and that will give us our data blocks back And obviously that's far more efficient than using something like The hamming codes So this is really really popular. It's one of the most popular error correction codes We use today and it's it's seen everywhere including CDs DVDs You can it's using 2d barcodes as well So any any 2d barcode that you scratch or is damaged can still scan because it's using reach Solomon to recover from That damage and it's also used for generating qr codes So most implementations of rate 6 also use reach Solomon as well so this image is taken from the Rage coding docs in Seth and it's specifically for K equals 3 and M equals 2 So two parity shards for every three dates shards So in Seth when you create a new pool You specify if you want it to be a rage coded or not and also which rage coding algorithm you want to use Seth has a pluggable architecture for a rage codes So you can specify which code you want to use on a pool by pool basis once you've chosen an Arrange code you can't then change it for that pool because you'd have to obviously recalculate everything so You need to make sure the only way to do the only way to get around that You'd have to create another pool and migrate So if you're going to use an Arrange code a cooling product pool in production You want to be sure that you've chosen the rage code the best suits your workload Almost all of the rage coding plugins within Seth rely on some form of reach Solomon There's a number of different varieties though with different strengths and weaknesses the main plug-in that's used in Seth is called J-Rajor and that has both Vandermonde and Couchy versions that those are both versions of Reed Solomon and they they differ just in how you generate the the matrix at the beginning There's also some more recent propositions There's one called the shingle the rage code and that aims to be more configurable and more efficient There's also an implementation called clay which focuses on reducing network bandwidth and there's one called locally repairable codes which Specifies a another variable called L. So you can say I want my parity blocks to be stored in a specific geo location Which helps us recover? From a range of erasure coding damage, but without straining the network too much if we have a geo-replicated multi-size self-cluster So yeah, the key thing to remember is choose the right numbers for KNM obviously depending on how much storage you want to use how much is going to cost you in storage and Also pick the right Erasure coding plug-in once you've done some testing With your workload. So you you know that obviously it's a it's a kind of more final choice So how does this relate to our work at Softiron? Well, we're working on building stuff appliances and there's lots of operations that we want to do in stuff Which are computationally expensive compression of age coding and encryption are all examples of this So we're working on defining some of these algorithms and hardware so that we can use FPGAs to do the heavy lifting for us and Keep keep our CPU usage to just just staff So yeah, that's that's my talk How many questions no mathematicians in the room, I guess So I think that the most commonly so the question is what combinations of KNM are best and how do they Effect performance within staff So typically we see people using six K and and three for M or Nine and four those are the most commonly used ones We're only implementing a few different combinations on our FPGAs for that exact reason because That each one is going to be different how you define it in hardware so I think there's a standard set that are used and Yes, if you try and do it with less M then you're going to have to do it's going to be much more expensive to calculate the number of M as a proportion So you want to use a ratio of about three to one. I think I imagine there's network latency Yes, there is late network latency so There is heavy network latency in SAF and that means that also the question is how does it affect? How does the distributed nature of SAF? affect acceleration So there is network latency in SAF and that means that in some cases the The cost of going to recover the data in time might be Still far less than the the expense of doing the replication of the network anyway So you kind of have to have a bit of a cost balance and figure out in which cases it might be worth not Accelerating the process in which cases it might be worth accelerating so it It's a trade-off and I think in some cases it is some cases It isn't and that's a very important factor in deciding whether or not to accelerate Does that answer your question? cool Okay. Well, thank you very much