 All right. Good afternoon. It's great to be here. I love a conference named uptime I guess you could argue that a conference named downtime would be even more appropriate for me But I'll take up time And it really been interesting to kind of watch all of these of these talks today And in particular, you know, I really enjoyed if the clicker is gonna work if not probably not because it's not plugged in So we'll just do it old-school. Um, I like I thought Craig's talk was really interesting and Craig talked about all these great attributes That we have with kind of Kubernetes orchestrated things and they're beautiful I mean, you know, you have all these things like, you know, immutability and and recoverability and so on but underlying them is a as a foundational assumption that these services are essentially Stateless and we love statelessness. I love statelessness. Don't catch me being pejorative about statelessness because I love it It's like the ability to model a program as our problem as a finite state machine If you have the ability to do that, you should seize it because you can make code That's totally correct and it's such a relief to be able to do that. I love statelessness. So stateless stateless. Yay statelessness except The world is not stateless. In fact, even stateless things aren't stateless, right? Say that when we talk about these stateless components, they actually have in kernel state They've got connections state They've got active state, but it's transient state. It's state that we can drain out of them Sadly, there's another darker kind of state in the world that we all depend on Persistent state the stuff that's actually going to sit on a disk somewhere on non-volatile storage somewhere and we've done actually I think a very good job of separating our concerns and Making sure that those of us who are working on stateful services Don't get to come to conferences where we get to talk about statelessness. So we don't get it's like wait a minute Like I will a mutability restart a bill recoverability. It's quiet you back to your stateful service, please Because and we've done a good job of separating that Because and it's important because statelessness does allow us so much and so much of the new code that we develop Can be or should be stateless. So that's a good thing But we do have this persistent state on this data path does exist and the data path is a Dark and terrible place. So for the just from purposes of my definition the data path consists of the software The hardware and the firmware that connects that stateful service that stateful service endpoint All the way down to the actual non-volatility that stores that state There's some non-volatile physical medium, which for right now is flash or magnetic media Don't believe the death of disk by the way I'm long enough to have heard the death of tape for several decades Disk is still very much with us. We'll talk about disks in in great detail But that those bits are gonna land on that non-volatile medium and the data path is that entire path All the way back in many ways the data path is a journey back through time as As you get closer and closer to the physical medium you get closer and closer to DOS As you get closer and closer, which is that that's a true statement sadly So and these systems that actually delivered that actually delivered this data path to us are themselves very complicated These are complicated distributed systems built on top of these non-volatile media We have these big complicated distributed systems and we need this data path to work And we actually have great demands on the data path and in order to be able to have this Beautiful statelessness that you all enjoy the stateful path the persistent path has to work all of the time It's really not okay for that path to be down at all we demand perfection of that and on on the one hand having to deliver that perfection can feel really difficult Because it feels like the demands are so acute on the other hand if if we can't actually rely on that persistence It becomes really hard to build infrastructure elsewhere. So we really need this to work all of the time So we demand that it is consistent that it is available and it is partition tolerant That is what we demand of that stateful layer Sadly we know from Brewer's theorem that we can't actually have all of those things and you know I knew for a fact that I would not be the first talk to mention cap theorem But at Bridget that hit us off this morning talking about captain and Bridget pointed out what Katie McCaffrey and others have also observed that All right consistency availability or partition tolerance pick two. You're like, I know I know I know I want consistency and availability It's like no, no, no put your hand down. You have to pick partition tolerance It's like but I do not want partition tolerance like I know you don't want it You have to take it actually so it's actually a choice between consistency and availability these a but I don't have partitions It's like no, no, you're having one right now So It is actually important that we don't actually get to not pick partition tolerance We have to pick partition tolerance and the question is now we're gonna make trade-offs now It can be easy to be like, okay cap theorem original sin. Okay Screw it like let's all just get stoned because like life is meaningless and we can't actually do anything and There's there's actually a I really want to watch by the way that this watching this later is gonna be such a treat for me so but the We shouldn't actually use cap theorem as an excuse to give up on humanity Because and actually there's a very interesting paper that came out of Google later Explaining like well, okay, it is true that like you can only pick you have to pick two of these You have to put tolerance But it's also true that if you engineer the system very carefully you actually can Engineer partitions away like I knew it. I knew it. I knew it. I pick consistency availability It's like no no easy easy easy because it's actually Exceedingly difficult to engineer partitions away. It's exceedingly difficult it takes a very long period of time and that's what we're gonna talk about in this talk We're gonna talk about the difference between a system that is that it theoretically has made cap trade-offs And one that is actually resilient in which we have done the best we possibly can On to actually engineer these things away the difference between those two systems by the way are the zebras What are the zebras so zebras and if you are I'm the son of a physician Which is gonna become relevant in just a second, but if you are a med school dropout or if you're a physician yourself Or if you just like following medicine you may have heard the term of zebra a zebra is a it's medical slang And it denotes a condition that is rare and exotic But can be confused with something much more common So med students and residents are very prone to want to diagnose the most Exotic thing you could possibly have it's like oh you're sick after the conference cerebral malaria. It must be that it's like well Okay, maybe not But it may be I'm running a fever there could be other symptoms that you would share with cerebral more able Let's not jump to cerebral malaria and this aphorism which was coined by a physician in Maryland in the 40s When you hear hoof beats think of horses not zebras, okay? So and this is for a hospital this makes sense for a hospital where when you solve an ailment It doesn't prevent someone else from having that same ailment sadly It's not software. You can't actually just like push a fix for cerebral malaria So it actually makes sense to be constantly aware of what is common what is likely? I don't want to go to the most unlikely thing that said and the soft underbelly of medicine is that zebras do exist and if you have had one of these yourself you know the Suffering involved when you actually do have one of these rare exotic conditions when you do have Cerebral malaria by the way if you ever go to a malarial region come back to the United States in your ill six months later Be sure to volunteer to your physician that you've been in a malarial region That's a very important data point because you will suffer in for a long time And I had one of these it was close to home so as I mentioned my father was a physician emergency medical physician I used to think growing up that being a physician was the easiest job on earth because you just look very thoughtfully at this injured child and Then say you'll be fine And then if it's really serious, I mean if you're obviously bleeding all over everywhere And this is clearly a medical emergency and you really have to dig much deeper. You simply say We'll keep an eye on it. So that to me was being a physician like this is a pretty easy gig like you know You'll be fine and let's keep an eye on it actually had a friend in college and she She had an ailment. She's like, hey, let's call your dad and would you mind calling him and ask him what's going on? I'm like, oh you can call my dad, but I'll just like save you the time and you'll be fine That's easy. So, you know, I actually want like a real medical opinion. I'm like, all right, then we'll keep an eye on it There you go. We're done. Okay, you don't need a more medical opinion than that But then she actually prevailed on me to call my father and I had my father spoke with her for a while And then handed the phone back to me and my dad says to me something I had never heard before since yeah Brian she's got to be seen like oh my god. You're gonna die like that's the morgue that you go to as far as I'm concerned So um, but why did my father say that because she was suffering from abdominal pain and if just for your life lesson This may be the most important thing you take out of this talk Abdominal pain is actually something that you don't want to screw around with if you're suffering from abdominal pain There could be a lot that is wrong with you And you should actually call an advice nurse present yourself to a physician because I Unlike kind of the scrapes and bruises that that a child has abdominal pain can be very serious And my sister had very acute abdominal pain My sister was on her way to Ghana with her boyfriend now husband And they were I had a layover in London and she had such excruciating belly pain that she could not get on the flight She they went to the hospital Determined that she had appendicitis and they gave her an appendectomy gave her a laparoscopic appendectomy in the UK Totally changed her position on socialized medicine. By the way, it's a very interesting kind of thing But she had a laparoscopic appendectomy They also discovered by the way that as an aside on the side that she has what's called a pseudocoinesterase Deficiency and I know that she we've already talked about pseudocoinesterase, so she's gonna crush it So my sister has and Now you can actually go go ahead. I think I'm on space balls So my sister had a pseudocoinesterase deficiency Which means that she could not metabolize the anesthesia that she had when they extubated her she stopped breathing and Because there was different standards of care where she was they didn't notice that she got blue in the face Um, I encourage you to do what I do back that in a second I encourage you to do what I do. Yeah, exactly a little preview Um, I encourage you to do what I do and I know she won't mind Claim my sister as your own sister the next time you're having surgery and say, you know, it's funny I've got a sister. I don't have any allergies to medicine that I'm aware of But I do have a sister with a pseudocoinesterase deficiency and all of a sudden watch your anesthesiologist Wake up like what wait, what would you say? Oh, okay? Oh, I got to get the red bracelet out so they get the red bracelet out You get the red bracelet on you the red bracelet is like hey everybody time to wake up So you don't kill me which of course everybody should have in the going to surgery But you only get one of these if you have a sister with a pseudocoinesterase deficiency. So I doctorize your own So but so my sister is that we're done laparoscopic appendectomy we're done She's in a wedding in California and she has the same acute abdominal pain on this time much worse She goes to the hospital again. They see tear. They still have got no idea what's going on In fact, they believe that they've found something in her abdomen And they're wondering if she has any kind of like strange like if she's into anything weird My sister's not into anything weird. So she's just like, what are you even talking about? and They they didn't actually know what was going on and my father at this point had lost his patients and said another thing I never heard him say but I is definitely exciting when your father's a physician He said you need to give her an exploratory laparotomy or I'm gonna give her an exploratory laparotomy And now I had my father give me stitches on the kitchen table But never an exploratory laparotomy like that was gonna be cool and I'm like come on dad Give her the exploratory laparotomy But they did it they cut her open and what do they find they what they found is emphatically a zebra They found this on so on this is a was a piece of my sister's gut from the time that she was actually Conceived this is something called a Meckles dive reticulum and the Meckles dive reticulum is is a little Umbilicus that you have before your true umbilicus forms This is like post for your umbilicus post being the power on self-test that runs before the bias This is the thing for which the bias is like the really high-level software So post is it for this is and this is obliterated after seven weeks post gestation So you're very much in utero for Approximately two percent of people that's not completely obliterated and there's a little eddy in your gut and for Approximately two percent of those people they'll become symptomatic They'll develop this thing called a Meckles dive reticulum or Meckles dive reticulitis That is all under the age of two That all of those cases essentially happened over the age of two my sister was presenting as a 29 year old with Meckles dive reticulitis These are the these are some of the largest Meckles stones ever pulled out of anybody Because these were sitting in her body since she was a kid and it explained so much all along I thought she was constantly complaining about kind of aches and pains growing up It's because she was walking around with a bag of gravel in her gut And like my sister does not have a lot I mean she and I are it is my it is my closest living relative like we're basically like a bag of bones like there's not a lot of Extra space for a bag of gravel, but you can see how large this is the The point is that this is something that took a long time to diagnose that was misdiagnosed a lot and it is on the one hand It doesn't happen frequently, but when it does happen, it can be really really debilitating. Let's talk with the data path Let's talk about zebras in the data path because We do very much have zebras in the data path because we've got all these different software components and Software if you're having a bad day, and you need me to remind you of this software is really amazing software is Unlike anything we have ever done Software is its own thing because software has this incredible paradox in that it is both information and Machine it has all these properties of machine today. We've been talking about all these kind of mechanical properties of software It's not a machine. It's information and it lives like information I know we like to think that everything is broken all the time and everything was written last night That's actually not the case and it's not the case because that software that functions that survives in perpetuity Never raises its hand to tell you anything it just works And we have lots and lots and lots and lots and lots of software in that great big stack of software That date back to the very dawn of software We've got lots of software that just works and that's the great news is that when it works correctly It survives in perpetuity the problem is it's very expensive like impossible to rewrite all of that We wouldn't want to rewrite all that because it took us so long to get it right what that means is the bugs that are left are nasty the horses have all been found the horses have been found over decades the only thing that is left are the zebras and We when good news is when we find a zebra we can eradicate that zebra the bad news is it's only zebras that are left So in terms of like how do we actually go hunt zebra? Let's go zebra hunting and I don't mind like picking on zebras by the way too much You know like I love zebras like you don't actually love zebras zebras have got a really foul temper So I actually don't mind picking on the zebra live you have you know I'm not gonna actually go hunting a zebra, but I'm happy to metaphorically hunt zebra So in particular when we're hunting these things we we should not ever assume that a problem is reproducible If we can reproduce a problem out of production Great that should never be are going in assumption are going in assumption should always be that we need to debug it in Situ in production we can't ask well. Let's go reproduce it somewhere else No, we're not going to go try to reproduce it somewhere else because this is an extremely strange thing Pathology that's going on you're not going to recreate my sister's condition in anybody else Be because it only it is so unusual So many of these conditions are very very very unusual and when we have one of these unusual conditions We can be especially and I put bias for action and quotes here because I am I'm making a reference to Amazon's principles which troll me into a stupor one of the Amazon's leadership principles include bias for action and Okay, I'm okay with by like tell me more when we are overly biased for action We can be biased for rash action And it's easy to be biased for for for panic It's easy to be to be biased for haste and we don't want to take the rash action in particular Restarting a component killing up component is the wrong first motion For something that's misbehaving in production. I think we you know I really enjoyed Jason's presentation where he had that full kind of story But I did wince a little bit when we don't actually know what Greg did to this Greg Greg took the process out back and did something to it terrible and now the service is fine. It's like Yes, but what happened to that kid like we don't talk about that process anymore like that process it's like okay, it just like disappeared um and Yeah Because you want to actually understand and I'm going to assume because actually Jason didn't fill in that gap So I'm going to assume that Greg saw that the process was behaving pathologically and took all of the necessary Information that he would need to later debug it and then we started it of course I'm gonna assume the best of Greg um, but When we actually have a service it is the wrong bias to immediately restart it The first bias is not should be to change the question a change the system, but to observe it Ask questions because and I thought again I really enjoyed Jason's full timeline because you appreciate what a tiny fraction of that is actually on system And I think this is true for many outages where you got all of this this work Beforehand actually getting the right person on the keyboard. You don't need to save yourself 20 seconds right now If you can get enough information to figure out what's going on to which we fix the problem Please do so so your first bias shouldn't be to change the system But to observe it and by the way don't go to a hypothesis first ask a question first Debugging is the process of asking questions and getting answers not necessarily formulating hypotheses a hypothesis should happen Very very late in the process which may be seconds later But it should happen very late in the process when the questions and answers have constrained the hypothesis space so much That you basically know what it is. You know what is consistent with all the data So do not jump to a conclusion and if there's one thing that I see that frustrates me It's when people immediately we start this we start that reboot That's like whoa whoa whoa whoa whoa whoa whoa we could be missing tons of opportunity To find deeper more systemic issues the process that's spinning that we're going to reboot Is that maybe that's spinning because it can't talk to some other service? Maybe it's spinning because it can't talk to some other service They didn't even supposed to be talking to you can discover an amazing amount when the systems just behave if you ask questions to get answers But it means that we have to be able to observe the system the observability of the system is truly paramount So for us a joint observability. I think it's fair to say is our organizing principle It is more than anything else what bonds us as as engineers as from the Across the company we believe that the systems should be Observable and that's part of the reason we made some of the choices that we've made so we're running what my shirt is about in terms of There's no place like user source UTS and if you're thinking I'm missing a leading forward slash I'm not user source UTS is the source base for a lumos which inherits the solaris heritage Which goes all the way back to Unix and smart OS is our derivative of that So we've got our own derivative of a lumos because it's designed to observe the system Around ZFS and detrace and zones So observability is very very important to us. We develop manta and manta is is our S3 like service although it's container centric you can spin up a container on an object So in addition to being able to put an object and get an object You can actually spin up a container where an object lives and save yourself the having to move that object somewhere else So it's our container centric object store. I it's got ZFS in its core. We use sharded postgres We use zk for leader election. You're like, oh zookeeper. Okay, if you just throw up in your mouth a little bit Like it was actually 2012 when we designed this thing So I if someone could also in in in addition to inventing raft They could also invent the time machine that they can shove raft into that would be awesome But meanwhile we're stuck with zk what it actually zk like all things zk zk once you get it running It's it behaves right it's it's the it's the getting it running that is a real challenge with the zookeeper And all of our services are primarily no JS and for us and if you think like wait a minute JavaScript I think you said you're about observability and debuggability We are and we spent a lot of time on the observability and debuggability of JavaScript and no JS And no and actually it is still the reason we use node is because node is actually more observable for us than any other platform Other than see so we are able to and we've used it over and over and over and over again The ability to g-core a running node process Restart that node process such that service is restored and then later go on to understand Why are we using so much memory? Not? Why is GC running so much? Right? It's like GC is my problem It's like actually your problem is you're using too much memory Right and GC is trying to help you out But it can't actually find your garbage because you have too much crap in your house so help GC out right and And we've used it over and over and over again to be able to actually de-trace these things use mdb and so on so we Observability very very important to us and join as you may or may not be aware of joint was actually bought by Samsung So we were bought by Samsung a year ago because Samsung wanted to build their own private cloud They saw Manta. They saw Triton, which is our also open source system for container management and for in for cloud management and They realized that they needed to have they need to control their own fate with their own physical computers So I'm one of the renegades that believes that Jeff Bezos is not going to own every physical computer in the arbitrary limit So I know it's a very it's a tough assumption, but just go with me for a second So Samsung bought joint and as a result the scale We're now seeing with Manta is now at Samsung scale, which is a whole different level of scale like the m's Alternate in the bees when you're doing with Samsung This is a very very very large company with a very very very large footprint That is very assertive about getting all of their stuff on to to Manta and Triton. It's very exciting The good news is that between the years of production we've had on Manta All of the observability the hyperscale post Samsung. We've nailed some really thorough or any stuff in Manta feels great That's great. Here's the bad news the bad news is that our stack and every other data path It's out there. We still have these components that are really problematic for us That actually cause us an enormous amount of downtime If not in the aggregate at least in the small and it's very very hard to resolve So well, why don't you just rewrite those pieces of software? I hear you asking and I I had those same desires The problem is that there is there is a zebra sanctuary in our stack Deep deep deep in the stack and this is where you get to not just metaphorical DOS But literal DOS there are these little proprietary Components that sits so closely to the hardware. They are helping the hardware lie to the software about what it actually is Who are the who is this terrible software? This is firmware on and Right now humanity is engaged in a silent war It is us versus firmware and you all need to choose which side you're gonna be on because The the firmware and the problem is it operates so silently. It's so deep in the stack I mean when you take out a motherboard when you take out a discontroller take out a nick You see all these controllers on here that are all running little operating systems that all have their own little firmware bits That are all being delivered out of someone's home directory who no longer works for the company because it's so screwed up there You know that when you're looking at it You're like I like I am looking into source code that they've lost right now. I can feel that I can It's warm to the touch if I feel it I'm and the problem is that firmware that that operates silently also fails Implicitly the firmware by the way is not wired up to your beautiful monitoring system Like the firmware is not gonna page you that that's way too. That's way too much work for the firmware It's like well I could I could page the operators or I could just not do this actually I'm just actually good Not gonna go to school. I Let's try that. This is like, you know my 12 year olds experiment with math tests like what if I don't go to school? It's like it doesn't work. What are you firmware out of my house? Um, and so with let's go through some of the various parts So and we're gonna go through some of these various dark bits. Um, we'll start with the spindle I did love with someone said earlier about oh, these aren't like simple mechanical systems like simple mechanical systems okay, so the the disc rotating magnetic media is a Mechanical marvel would anyone guess the fly height of the head on a drive Does it know know the fly height of a drive any guesses? Five nanometers an excellent guess and if I didn't know better I would say five nanometers that seems a little aggressive, but I would go more like maybe ten nanometers point eight nanometers and It's talking to an exact at a disc drive company when he said that and all of us just like whoa I'm like you mean 800 picometers. He's like, yeah, I guess I do mean 800 picometers again. It's like This is an extremely small amount of space 800 picometers, um Gibberish That's useful Yeah, you know my kids are gonna find this incredibly useful when they see this So these things are so low you can imagine Honestly, if the head hits a particle of smoke it can crash that can crash the head that is how How fine they are how finely machine they are it is incredible any particulates in there magnetic dispersion certainly wear certainly certainly certainly vibe on Certainly temp on all of these things actually will fail the disc will fail this amazing mechanical Marvel But the disc actually knows this Amazingly enough the disc is actually aware of the fact that it that it is cheating physics all the time and It has this incredible map of like oh, okay. Well, we must hit a particle of smoke there Okay, don't go there and it has this this map of the drive and it knows where it's going and where it can't Which is and it's storing your data many times redundantly on that drive and it is able to reassemble that I mean that the drive is incredibly incredibly sophisticated, which is amazing But that's also a lot of software and it leaves some much nastier failure modes Namely you've got software based failure modes. So discs absolutely can read and write the wrong data And you're like, how is it you can have a fly height of 0.8 nanometers and yet you're giving me the wrong blog address It's like well, that's the hardware versus software. I don't know. What do you want me to tell you? But a disc absolutely can return the wrong data it can write the wrong data And we saw this reality coming in the early 2000s and zfs is very much designed around this full path Data integrity such that zfs knows what it wrote where and it can verify that when it's pulled off of the spindle I don't know why anybody at this point especially given that it's open source would rely on anything other than zfs to Just sort of their data, but it's your data So zfs has been huge for that and we've discovered all sorts of data corruption I do love zfs will discover data corruption and things that are too expensive to have data corruption You're like you realize that doesn't follow and like it doesn't actually Doesn't actually matter how much you paid for this It's like no But I paid way too much money for this sand to have data corruptions like well it has data corruption It turns out when you're paying that much money What you're actually paying for is the right to have an executive VP lie in your office and weep for forgiveness That's the actual Service that you're buying. You're not actually it's like oh, yeah, no No, we cut that all the time, but on where should I send the evp on Tears maximal tears small tears. What do you want? Um, that's what you're actually buying So we we saw all sorts of crazy data corruption We saw data corruption in things that we felt couldn't possibly have data We saw data corruption and a controller where the last 64 bytes of Many kinds of pages would simply not be DMA'd out and there were millions and millions in millions of this controller And we spent a lot of time Debugging ZFS before we came to the watching conclusion that it was not ZFS ZFS was actually accurately reporting that the data was wrong. So It's been amazing But even ZFS oversimplified the failure modes of discs discs have got lots of ways to fail that are not simply reading and writing The wrong data and in particular you know when I have retained a firmware rev it's bad news And the C gate barracuda su0d. I will meet you in the afterlife This was a firmware that had there was a logic error in the in the software They control the head and it misapplied the polarity of the head such that instead of Decelerating the head when it got to a high logical block address You would attempt to accelerate the head when it got to a high logical block address and the disc itself the J tags on The actual spindle would bounce the spindle to prevent you from actually destroying the drive And what you would see is this strange? 558 millisecond outlier like what is that and that is the time it took that drive to reboot and Because you would see it at a high logical block address You would have these arrays of storage that were fine and happy and happy and happy and sad sad sad massive sad sadness everywhere And very very very very very frustrating when these things are not doing what you feel they should be doing Those are the spindles. I think with flash we were very early in flash So at Sun I developed along with that 11th all and others we did develop one of the very first Array to really use flash and we saw inside that sausage factory a little too much So with flash we saw this entire operating system in the actual SSD That is responsible for ware leveling and all the scheduling and so on so we were and I have been in still in very concerned About SSD failure because there's so much that can fail Ironically because we've been so concerned about those failures We actually at joint and even at Sun back at the time Haven't really suffered serious flash problems because we overengineered the heck out of our SSDs So we get we have SSDs that have allow way more drive rights per day In fact, we want so many drive rights today. They're like we don't actually make one like that You're like well go find a way because we want to have very reliable SSDs because I've always been concerned about if it's not a word I am I fear it will become one the flashed astrophy of Having many SSDs fail at once in the same way, which is Completely conceivable given that the complexity of software running in the SSDs the complexity of firmware in the SSDs The the HPA the host bus adapter is the thing that actually takes an IO from in the operating system And actually sends that off to a disc how complicated could that possibly be glad you asked excruciatingly complicated Everything can go wrong with this there there can be many bits of firmware You can have something called a SAS expander that has actually sitting between the HPA and the disks There's all sorts of ways to make this excruciatingly complicated and these HPA firmware Loves to just drop IO when it's under load I mean okay look everyone loves this I would love this like the ability to just like drop work when you're under load There's only one body of software that can get away with that and that's the TCP IP stack It is actually funny when you look at the code We're just like wait a minute You're in a condition that you don't like and literally the comment is like a really complicated condition to handle free message It's like what? It's like it's fine. If it's important. They'll resend it. It's like whoa. You can do that It's like that's like me and my mail. It's like well I haven't opened this for six months and I don't seem to be arrested so all right um Wow, you can do that. I didn't know you could do that So um the networking code gets to get away with that of course that you're like that engineer Of course is inducing some late some hair pulling latency bubble for some other engineer Thousands of miles and perhaps thousands of years away But it is what it is the the HPA however can't do this the HPA is not allowed to just like decide It doesn't want to come to work in the morning It actually needs to do IO and so what you will see is these systems that are are running along barreling along And all of a sudden they'll grind to a halt because one IO went MIA and you see wait a minute How can one IO? Possibly causing tire system to dogpile behind it glad you asked Because in a complicated system It's really easy to have these implicit interdependencies where that one IO That's the uber block that ZFS needs to write out like we need that IO to complete or not complete We don't care which I either come to work or don't come to work But you need to tell me either way right so you can't actually just like go missing And these latency outliers and you know after ZFS will retry it the IO stack will retry it So you'll just see a latency outlier We actually saw one of these recently where it was so far up the stack We were trying to figure out why we're getting all these RSTs Because TCP is like what's going on with that guy like actually RST the connection It turns out what they hit a latency outlier with the HPA So these things will be really serious and of course that was solved with a firmware update Which is just like I did not need by the way I needed no additional substance for this talk, but just because I The universe loves to biblically punish me in this way when I conceived of this talk I had a congo line of broken firmware that made its way to me We had all sorts of issues just in the last couple of months around this Another one is the dim so the DRAM might we love DRAM. I love love love DRAM and part of the reason actually The SSDs actually did not quite take off the way many of us thought they would because DRAM is so great It's so fast. It's so cheap. I love DRAM instill it until it fails And DRAM can fail for a bunch of reasons that have to do with the hardware You have corrosion you can have humidity you can have a bunch of environmental factors But one of the challenges we have with DRAM is that as the speeds have increased and the voltages have dropped We are seeing more transient errors on DRAM. We are seeing more signal integrity issues We are seeing more issues where the box simply resets with an uncorrectable error And one thing that is very maddening is that you have these boxes that see an uncorrectable error and never saw a correctable error It's like how did we possibly just physically if we're gonna have one of these issues physically you expect to see Correctables before the uncorrectable. What happened to the correctable error? Oh, we sent that to the firmware with and I'm not making this up and they actually capitalized it this way the firmware first Model it's like is Donald Trump actually in my firmware right now. It's like This is like firmware first. I got like the firmware marching through my machine. It's like No firmware first. No, I'm sorry. No, no, no, and I will be unequivocal about this There was none. There were not errors on both sides in this case. I Want to be unequivocal Firmware first is the wrong model for error handling and so Honestly, not a joke. So there's something of the CMCI was the craze the correctful memory check interrupt Which allows the operating system to know that a bunch of correctables are going on and I'm talking to the vendor like yeah We don't do that. We've changed the bias and we've added a feature called cloaking I'm like called you're confessing a crime to me right now basically like you want to you should call your lawyer It's like, where do I turn off cloaking? It's I know you can't turn off cloaking. It's like, where did you tell me about cloaking? We don't tell you about cloaking. It's like, okay. This is actually illegal Um and and what actually happens is why can you not give me an interrupt and you see a correctable error? Oh because we would give you too many interrupts. You give me too many interrupts. Okay, how many errors do we see on DRAM like all the time all the time all the time I have had on some of these calls I have half-expected these guys really kind of work with you Please can you get me out of here because nothing works over here? Um, it is actually Terrifying the the rates of failure that that we are seeing and I believe we are seeing right now a silent epidemic of dim failure And so the next time your EC2 instance just kind of mysteriously disappears to come back You can wonder if it wasn't firmware first and your app last in terms of what actually happened In terms of the chassis you would think okay like the chassis Surely has to be immune from software. I'm like, well, I'm glad you think that it's very cute. Do you think that? I'm Unfortunately the chassis is also managed by a bunch of software on there's a system controller on the chassis that system controllers often responsible for Managing the fans and if those fans are mismanaged they can actually do damage to the box One of the amusing errors we had in years past was that the system controller itself ran out of memory It ran out of its own memory such that if you told it to reset itself It would tell you that it doesn't know who it is And it's like I do not know what you're talking about I'm talking about you like you're right you're right. That's all right never mind I and When that happened the fans would run full speed and when that happened the spindle started getting all sorts of latency We had boots pedals on there that had all sorts of latency issues So vibe can be serious and you're gonna wear those things out So you can actually have zebras in the chassis as well. You can have zebras in the nick There's a ton of firmware on the network interface card There are there are optics that connect that that network interface card at the top of rack switch All those things can fail and they actually do and it becomes very tempting to say like okay The option is gonna fail. What's use link aggregation? Let's actually have two links together and use them as one like okay. That sounds great. It feels great That's like now. We're gonna be redundant except we're also a lot more complicated and Hopefully one of the themes that you take away from this is that the world is actually very very complicated too complicated as it is right now and We need to be very mindful about making it more complicated in the name of availability Because that complexity will cut against the very availability that we're trying to deliver and in the case of lack p You got something else called m lag m lag is the thing where the actual two switches need to talk to one another to make sense of These two links that are talking to it and that is like the last code path to be verified by switch vendors We have found so many switch bugs in m lag So very very frustrating there which leads to the top of rack switch the top of rack switch can do the top of rack Switch may actually be the firmware in the top rack switch if you actually Want to you as firmware wish to to to have a blow against humanity The place to start is the firmware in the top of rack switch Because the amount the blast radius of a firmware mistake in the top of rack switch is enormous So we we had a actually was very helpful for us when we were doing a PSC The due diligence before Samsung actually bought joint and they had a bunch of hardware They wanted us to run on hardware that they had found in a dumpster I'm quite certain which was actually very useful because we saw all sorts of new failure modes in the software This thing just like could not hold on to art tables to save its mind And so it would constantly just chuck all of its art tables and the entire system would go split brain as Everyone tries to figure out what's going on and all this upstock software got it's very confused We had another switch not that one We had another switch which like to DDoS the system is kind of like a hobby like when it wasn't actually routing traffic And in particular you could send it a single malformed packet and it would begin to broadcast that malformed packet to everybody in perpetuity It's like thank you very much And that actually that bug was so bad the vendor actually apologized something that very rarely happens And it is it zebras all the way up right we were talking to the kind of the very lowest bits of the stack But these things have manifestations way way up the stack and of course software gets it wrong all the time It's not necessarily software versus firmware remember it's firmware versus humanity not software versus firmware the Software gets it wrong too, but the software that we can see that's open source. We can actually do something with I mean They honestly the person that may have most single-handedly Advanced the quality of the software that we rely on in the data path may well be Kyle Kingsbury with Jepsen So Jepsen is a is a software suite that takes these distributed systems and sees Just checks out to see if it does what the distribution claims that it does and very frequently it doesn't if they're not linearizable They're not serializable. They they get data out of order. They have all sorts of pathologies and the Unfortunately, it's gotten a little bit boring now because everybody knows to do their Jepsen run before they announced that they've solved all problems It used to be a couple years ago that people would announce that they've solved all problems And then you do the Jepsen run, which is a lot more exciting because you discover all sorts of pathologies, but We can do that when we're up stack We can do that when you can do that on rethink to be reading to be by the way not dead I just want to say that we I had the CNCF and the Linux Foundation bought the source code for reading to be very much alive But you can do that for rethink to be you can do that for these databases and and have some assurance There is no there is no Jepsen for firmware firmware like is its own Jepsen I guess I don't even know it's like it is its own chaos monkey but that's part of the challenge we have and I think one of the things that I see is that these components are actually becoming less reliable that I am very concerned about is that these Components that we build on are getting less reliable over time And the fact that we have software that can deal with unreliable hardware is not an excuse for unreliable hardware And by the way, this is what the hyperscale folks have figured this out You know Google loves to have the kind of it is a beautiful myth I mean, it's a fact But it's also a myth about the the motherboards that are just shoved in with Velcro, right? And you know you walk into Google and you see and this is like the very first deployments of Google They just bought the cheapest motherboards they could find they would do anything to save a nickel And they just had Velcro on the side of the motherboards and just shoved it into a rack just motherboards and it feels great But it's a terrible idea because you can't actually build reliable infrastructure on that kind of quicksand You actually need to have much more a much more stable foundation So this is not our reliable software should not be an excuse for unreliable components and then a little blue oyster cult reference I would encourage people to not fear the zebra The on the one hand the data path should not be undertaken lightly One thing that I do find a little frustrating is when people endeavor to to walk the trail of the data path Without realizing that you're not going backpacking You're going into a war zone and if you don't have your collision a cough packed You don't appreciate the severity of the situation. So When we are going into the data path, we should not be deploying brand-new software to the data path And by the way that software should When that software is early on in its life, it should never be giving you the wrong data It should it will have other failure modes. I think you know I saw if my data was some of the very first data that was on ZFS ZFS didn't give you the wrong data the difference in ZFS in 2003 and the end ZFS in 2017 is the ability to deal with those much more pathological device issues where devices simply go away So these things should always give you the right data when they when they start off and do one should not undertake it lightly We need to enshrine observability. That's been great to hear. That's a theme that we've had I've had for a long time great to hear so many others Having that same theme we need to reward complete understanding not merely resolution I made the problem go away great. What was the problem? Did we let's make sure that we're asking the questions that we pulled at all the open threads and as long as it's Unobservable which it is for the moment firmware actually is the enemy and we need open firmware I don't know for ever gonna get there. I that's like to me that will be the post singularity rapture will be open source firmware You know for some many people it's you know, it's Bitcoin and everything else for me It will be it will be open source firmware and as long as we are open source We actually do have a quality ratchet. We can make this thing better all the time So for a couple just in closing I want to give you a couple of things to look at in case I've made you too personal about firmware. Well, that'd be good That'd be mission accomplished, but if you don't know her you should check out Michael Elizabeth scott's amazing Amazing rip downs tear downs of these devices where she goes through all of the firmware That's all in these things reverse engineers them so she can hijack their functionality to do something much more interesting Than people intended so she does things like take the firmware image and actually pull it up in Photoshop And she can use that to find out where the boot loaders. It's it's amazing is actually amazing And it's so much more like I just want like the the IO to be done Like I am very boring compared to what's anyway. That stuff is great And we all of our rfts a request for discussion or all public you can go check out all that to understand what we're actually doing I if you're curious with an OJS, but I am going to send you today, but she goes talk and finally a huge Thank you to Amanda Lundberg for captioning me Great, thank you very much