 Hello and welcome! I am your host Chris Nova and I am here with my good friend Dr. Rachel Vida and we're going to be doing our presentation today for you. This is called Lives on the Line Disaster Response from the COVID-19 pandemic. So we're going to be talking about COVID, medicine, the medical industry and how that corresponds to disaster recovery and cyber reliability engineering and security in cloud-dative technologies. So to kick us off, I'll hand it over to... I'm going to call you Rachel, I hope that's okay. Yeah, great. Totally. So without further ado, my good friend Rachel slash Dr. Vida. All right, hello everyone. I'm Dr. Rachel Vida. I'm an internal medicine physician trained at University of Washington. I am not Kubernetes informed, I am merely Kubernetes adjacent thanks to my spouse Joe Vida. That's how I met the lovely Chris Nova and this picture you're looking at here is one of the places I have been the most scared in my entire life. It's the Roman wall below the summit of Mount Baker and Chris coaxed me up like the mama bear that she is so that I could get to the summit and not die. So today I am excited to talk to you about my medicine perspective on the COVID-19 pandemic and how it relates to computer security. Well, Chris will take that part. I'm going to talk about that part. Awesome. And hi everyone again. I'm Chris Nova. I don't even know what to say right now. Computer science, hacker, author, engineer, troublemaker, professional public speaker when there used to be a public. Now I think we cross out the public and I'm just a professional speaker or at least I was at the time of recording this. I like to climb mountains and I'm pumped to be talking about medicine and mountain climbing and computer science and security and site reliability engineering today. So welcome to our presentation. So to kick us off, I'll kind of give an overview of what we're going to be talking about and Rachel and I'll kind of go back and forth a little bit and I guess kind of just to give a little bit of context. As I was going through the quarantine, Rachel, who is also my doctor, and I were texting a lot because naturally as everything was happening, every time I coughed, I was like, oh gosh, better text my doctor because I'm sick. And as I started going through a lot of life in 2020 and existing in the pandemic, I started to think about how a lot of things that I was texting Rachel about were very similar to the same things I do in my day job, which is site reliability engineering, computer science, and can software security and distributed systems. As we started to explore this more and more together, we discovered there was a lot of similarity between the two. And as a way to teach folks about both the pandemic and computer science and what would otherwise be a pretty scary realm of things, we're going to be drawing some parallels between the two of these and how they're similar and how they're different. So what we have today for you is we have four main categories that I think we're going to kind of talk through. And we'll elaborate on each one of these as we go. And we've broken each of these four main categories down into just smaller subcategories. We're going to go over, we're going to kind of take a breadth first approach at describing this. We'll talk about the outbreak, what that meant. And in this analogy, the outbreak is going to be some sort of technical event, something happened that we weren't expecting, whether it's a computer system being infected with a virus or some sort of catastrophic production outage. We'll talk about the alarm and what that looks like and the warning signs associated with that. We're going to talk about the response. How does one respond to these types of things? What was it like for Rachel as a doctor and what is it like for you and your cloud native system? We'll talk about the importance of quarantine, what that means from a viral perspective and what that means in the name of Kubernetes. We talk about information sharing and probably more important misinformation sharing. Further, we evaluate our response. We talk about two concepts that are both important in security prevention and detection, which are equally as important in the medical field. And then we're going to go into finally the future. What is coming next in the world? And what does that mean for you and me? And how would we potentially recover a system from some sort of catastrophic outage that we'll be exploring in this talk? The whole point of this talk, I think, is to be high level, kind of a 101 introductory to these lower level computer science terms and to present them in a familiar and fun way for folks at home. Cool. Let's jump right into it. I'm going to kind of be picking on Rachel here and interviewing her and asking the hard questions and she'll be able to help us out here a little bit. Let's start off. We have the outbreak. It happened. There's some sort of incident. And let's kind of walk through what this was like for you as a doctor. And you're based out of Seattle, right, Rachel? Yes. I have a medicine clinic on Capitol Hill in Seattle. And that's kind of where everything started here in the United States. Yeah. So we started hearing about a new infectious disease in the Wuhan region of China. It was just after the new year. We started hearing that it was incredibly contagious in that you could spread it before you even had symptoms, which is different from the first go around with SARS. It seemed to have a higher death rate than a typical respiratory virus. And then a couple of weeks after we started to understand a little bit more about it, there was actually a big holiday in China, Chinese New Year, where families typically get together and travel. And that was sort of the catalyst for a lot more spread very quickly. So we started to hear about this in Seattle. Of course, it's tempting to be isolationist and think, well, that's not here. That won't affect me. But of course, everything is global. Everyone travels all the time. It was pretty naive to think that something this contagious was not going to affect us. Interesting. Do you remember where you were the first time you even heard the word COVID? I mean, I was probably scrolling Twitter, but we started to talk about it in the clinic and my practice partner, and I wanted to be pretty proactive about this like, okay, patients are going to call us. We have smart patients. We have informed patients. They're going to say, well, we have patients that travel a lot. They're going to be calling us. So what is going to be our response? What do we need to do in our clinic? Do we need to worry about it being brought into our clinic? So we started having these conversations pretty early on. And I remember one of our employees at the time was going to travel for the February school break to a bunch of different Asian countries. And we asked that employee if when they came back, they would mind working from home for 10 to 12 days to self quarantine. And that was really controversial at the time. We had a lot of internal strife with this decision. But I remember really early on thinking like, this is probably going to affect us. We should probably be more conservative about taking this seriously. So I remember having those conversations very early in 2020. Awesome, awesome, awesome. So the parallel that we're going to draw here is we're going to start looking at what that would feel like as it relates to some sort of cloud native system. So in this world, when I say cloud native, I mean some sort of computer system running Kubernetes with a handful of applications running on top of it. Those applications can be used to instrument your system, or those applications can be used for you and your business, whether you're running a front end e-commerce website, perhaps you're running a doctor's office with a health portal, any sort of user level application running on top of some underlying infrastructure. And we're going to talk about what it would look like to start to have the same type of experience of starting to notice something goes wrong in your system. Whether in Rachel's case, she was reading on Twitter, in our case, we would probably start to see some sort of thing starting to pop up in some other system. So this is a good opportunity to share the lesson of observability. And what we mean when we say observability is your ability to understand what's happening at the system layer, and more importantly, to connect all of the different components of the system together. In other words, in this example, we have a physical, we have system, and we have applications starting from the bottom up. And in the physical part of our system, these are things like compute, storage and network. This is the Linux layer, the physical layer, something you could go and touch. And really what Kubernetes does is it does a good job at orchestrating this physical layer for us. And so observability would be our ability to be able to detect when something in that layer of the system starts to go wrong. So this is where we get into tools like Prometheus. And later we're going to get into more advanced runtime security tools like Falco. We move into the system, and that's going to be everything from kernel tracing, to operating system logs, to operating system introspection. And we're going to be able to start to map those together. And that's where we get into these high cardinality observability tools that you see around the ecosystem. And last but not least, we have application instrumentation, which is a fancy way of saying glorified logs in my mind. You're actually putting lines of code into your application that can then be scooped up with tools like Prometheus such that you and your team can know about it. So once you have this observability system in place, what is that actually going to look like for you? The first time you see something like a pandemic breaking out in a computer system, it's probably just going to look like a little blip in a graph. And that little blip might be something like perhaps one of your pods in your Kubernetes cluster ran out of memory. Or perhaps we started to detect system calls being executed that have never been executed before with Falco. And as you start to observe these, hence the term observability, you're going to start to have the initial warning signs of what could potentially be an outage or some sort of threat moving downstream. So this is a good opportunity to enforce the fact that you want to have some system of observability in place. So when something like this starts happening, you and your team are going to be able to respond to it later. So let's talk about the alarm. This is the exciting one. This is where the rubber meets the road when something happens. So what was it like for you in Seattle when all of a sudden the alarm got sounded? Yeah, I heard reports that there were two people that had coronavirus in Washington state. We were pretty much the very first hotspot, not the biggest, luckily, but probably the first. And there were two cases. They were both in the county north of here, Snohomish County. So of course, you're like, well, that's not my county. It's not going to affect me. And you try to reassure yourself, well, these people had actually come back from China. They were traveling so clearly. It's not spreading from person to person in Washington. Like it's still going to be okay. There's all this rationalization everybody tries to do. But then we had nursing home cases at Life Care Center in Kirkland, which is probably 15 minutes from here. And that was all over the news. I'm sure you probably all remember hearing about it, but it was spreading from person to person in a nursing home here in King County. So you can't ignore it. You can't pretend it's not going to happen. Like it's here. It's spreading. People started dying. And then very shortly after that, this was in the beginning of March, Governor Janesley closed the schools. Well, Seattle closed the schools. Governor Janesley had a quarantine initiated. And we are all sort of realizing like, this is here. This is scary. It's not going to go away. And so rapid intervention was made in Western Washington. Rapid intervention. Oh my gosh, I love it. So relevant. That's exactly what we want to see in sort of a cloud native system, right? If something happens and you have some sort of alarm in the case of Rachel, it was, you know, the governor sat in the alarm closing schools, people dying. You want to be able to do something. So in the case of doing any sort of forensics after an incident in computer science, the first thing you're going to do is you're going to want to be able to retell the story to understand what happens. And you're also going to want to be able to make sure that you bring your systems back online if something's offline. So what exactly does that look like? So this is where we get into the beauty of the cloud and the ability to take snapshots and the ability to recreate systems and replicate environments or perhaps to back up data such that you could actually go back in time and see what the state of the world was like at the time of the event so that you can begin to piece it apart and make sense of it and build it back together. So an alarm can come from everything from perhaps a metric violation in Prometheus, but more importantly, we look at Falco and its ability to trigger a runtime alarm when something unexpected happens in your Kubernetes cluster. So it'll trigger an alarm and you can respond to that in the same way that Washington State responded to the coronavirus and asked people to quarantine inside. In the case of Falco, if you trigger an alarm, you can then begin to quarantine your nodes in Kubernetes and you can begin to isolate them such that whatever was infecting one of those nodes, you can make sure that it's no longer horizontally spreading to other nodes in the cluster and alleviating that horizontal attack factor that you would see. And this can come in everything in the form of outages to some sort of malicious behavior. The point is, is you want to be able to respond to it quickly and in order to respond to it, you have to know about it in the first place. Okay, speaking of response, let's talk about response. One of the beautiful things about being an SRE or working in security is a lot of the time we actually carry a pager. I'm not sure, have you ever been on call like that before? Oh, yeah, definitely. Definitely. I remember having, well, you get used to your pager going off at all hours when you're an intern and you have no control over your life to the point where now if something buzzes on me, I still this many years later have a little bit of that, oh no, there's my pager going off even though I'm not wearing a pager anymore. So yeah. Yeah, yeah, pager duty is it's, it's like a love, a hate relationship to be honest. Yeah, so, so, you know, everybody remembers we had to go into quarantine. The big goal with quarantine, of course, is to get the case number down. So because this is a disease that is spread between people when they're in close contact, person to person. Primarily, the goal is to have people stop being in close proximity with each other to slow the spread. I mean, there's some, there's some fomite spread where somebody sneezes on a door handle and then you touch the door handle. But honestly, the most, the biggest way that coronavirus is spread is person to person inhaling particles that somebody coughed, sang, yelled, spit out of their mouth into the air next to you, you breathe in the particles. So we want to get the replication number down below one. So to explain what that means. So the R naught or the replication number of a disease is how many new cases you get from one infected individual. So initially, we thought that coronavirus had an R of about two to three in clusters and certain areas it may be as high as four to five. But we want that value down below one because if the number of new people infected from a current case is less than one, the disease shrinks, the disease will eventually go away if you keep that R below one. Different diseases have different infectious abilities. So measles is one of the most incredibly infectious diseases. Somebody can have measles cough or sneeze in a room and there are still measles particles circulating around in that room hours and hours later. And so the R naught for measles is 12 to 18. So 12 to 18 new cases of measles will come from one infected person. So there's a lot of talk early on in March of getting the R naught below one. I don't know if you remember hearing about that, but it's a big goal. We're still not below one. The R naught crept down to about 1.2, but it's going back up again. It's at about 1.4 to 1.6 right now. So we still have work to do. Awesome. Let's talk about quarantine in the name of Kubernetes, right? So if we look at traditional computer viruses and how they relate to Kubernetes, right away the whole point of Kubernetes is it gives you a set of abstractions that make it easy and convenient for you to access other pieces of your infrastructure in the same cluster, which from an attacker's perspective is fascinating because theoretically, if once you compromise the off material, you not only have access to other nodes in the cluster, but you have a wonderful set of tools that people put a lot of time into ever making it convenient for you to access other nodes in the cluster as well. So you see that same type of telescoping nature and those some type of curves that you would see in an infection spreading, but in some sort of horizontal attack vector in Kubernetes. And there's a handful of those. And if you're interested in getting involved, please reach out. We can connect you with folks in security and talk about how you can get involved with not only understanding the different attack vectors in the potential, you know, in CVEs, but how you can prevent them moving forward. We also look at failure rates and how you can respond to them with things like cluster API. Perhaps you notice that you have a handful of nodes, some of them have been compromised and you want to grab some snapshots of them and take them down. Cluster API would be able to give you a set of tools to mitigate that and to mutate those nodes at runtime. You could plug that into the rest of your backup system and that would actually be able to take those down and to drain the nodes and to get whatever infection was inside of your Kubernetes cluster offline such that you could bring up new immutable infrastructure alongside of it. And we're back and my camera died. But yes, network isolation is going to be a big part of quarantine. If you can alleviate the infection or the malicious user at the network layer, you're going to drastically diminish their ability to communicate with other pieces of infrastructure within sight of your computer system. So again, we see that quarantine is important here, but more important than that, the ability to take action and to understand that once you have one potential compromise, you can see that spread to other parts of your infrastructure as well. Oh, this is a good one. All right. So with the coronavirus, there was a lot of information coming out really fast and there was some misinformation. Some of the early information was well-meaning and we thought it was probably good information, but it was sort of rushed out as quickly as possible without being peer reviewed or really confirmed. And so first, let's start with the good. The good stuff was the Chinese government and the Chinese medical establishment worked incredibly quickly. So the first big government report of the coronavirus being an actual disease and an issue is about December 31st or January 1st. They had the coronavirus genome sequenced by January 12th. That is mind-blowingly fast. And then they open sourced it. They put out the coronavirus genome globally and said, look, here's what we found and that allowed everybody to get started on potential vaccines that much faster. So that was awesome. PPE protocols. We got these lovely PDF downloads of exactly how to don and doff your PPE, what PPE was required, how to have teams, how to have an observer, how they also had some a lot more somber things about how to take care of patients once they passed away, how to prevent spreading the virus from a deceased patient. All of this was open sourced and passed out around the medical community around the world and undoubtedly saved a lot of lives by our Chinese colleagues in medicine wanting to get this information out as fast as possible. There was a lot of misinformation. So at first, there was a lot of talk about, don't worry, just make it till summer when it's warm. It's just going to go away. Well, that didn't happen. There were small observational studies. There was one from France that touted that hydroxychloroquine plus or minus azithromycin would decrease mortality rates from the coronavirus. That was put out in a journal, but it wasn't peer reviewed. It was kind of a nightmare because once it actually got peer reviewed in, for example, the British Medical Journal, which I have a link here on the slide, they couldn't replicate it. And subsequent studies couldn't replicate it. And it's unfortunate because patients who actually need hydroxychloroquine to treat certain autoimmune diseases then couldn't find their absolutely critical medication because people had started stockpiling hydroxychloroquine when it didn't even work for COVID. There was a lot of other misinformation about different medications. There was the week where we all were too afraid to take NSAIDs. There was the week where hypertensive patients were afraid to take their blood pressure medicine because of ACE inhibitor interaction with COVID. There was misinformation about how it gets transmitted, about whether if your delivery person delivers you an Amazon package and then you touch the package, are you going to die of coronavirus? All sorts of misinformation on the internet, and so it was hard to know what to believe and what was going to change a week from now and what kind of mask works and what kind of mask doesn't. And then there were people saying, don't worry, it's just like the flu. Only people who have a lot of medical conditions will actually die from it, which was both horrible to say, but also turned out to be wrong. So the whole thing wound up with everybody in the state of panic initially. Yeah. And as we'll see, and if anybody here has ever been in a disaster response situation where production has gone offline or something has been compromised, like exact same situation. It's people, there's a state of panic and people are starting to assign blame and you go into this chaotic state of misinformation and speculation. So let's talk more about that. So the first thing Rachel mentioned was the importance of open sourcing information. So we see this in open source throughout the ecosystem and you also see this in security. If you do happen to discover a vulnerability, there is such a thing known as responsible disclosure that allows you to first give somebody an opportunity to fix it. And then once we fixed it, we're then able to roll out some sort of patch to the rest of the ecosystem as well. But again, none of this would be possible without the concept of open source and sharing and realizing that sharing and coming together as a community is actually more important than trying to keep this to yourself for your own use case. So a huge lesson in humanity here and the importance of sharing information as needed. After this, we look into postmortem evaluation. So this is the process, the ceremony, in which you'll notice that a lot of... Anyway, we were talking about postmortem evaluation and what that means. And like I had mentioned, it's the exercise of getting together after some sort of event and trying to understand what happened. It's this process of going through with you and your team. Typically, you're not assigning blame. It's not a blameful exercise. You're just trying to discover what happened. In the same way we see this with COVID, how did it get here? How did we respond to it? And what did we do to prevent it from happening, moving forward? And so other misinformation you might see as you're going through a system is be aware of things like false positives and red herrings. So these are things that can come up where it may look like you're detecting the problem or it may look like you're actually understanding what's going on. But the truth is if anybody here has ever debugged a live production system before, the truth is typically much, much more sinister and usually involves DNS. Anyway, we're looking at correlation versus causation. Of course, everyone knows those two things don't necessarily have anything to do with each other. And last but not least, we have speculation, which it can be a good thing if you and your team are speculating perhaps some sort of attack vector or some sort of reason for your systems going down or some reason why your pods are getting killed or something happening. But it isn't always necessarily the right thing. And sometimes it's easy to get fooled by speculation to think that you've discovered it and that can actually end up causing more problems downstream. So the main takeaway here from a computer site reliability engineering perspective is just to be skeptical, just to try to prove things, instead of trying to prove things right, try to prove them not wrong. It's a good rule of thumb that I like to follow. So moving on, two really important concepts here that I have already mentioned once prevention and detection. And the differences between the two and how they're both relevant for both medical industry as well as computer science. One way I like to look at this is security policy for humans. I like that. Nice. All right. Well, so we're all pretty good about knowing what we need to do to prevent COVID-19 at this point. I mean, you've all heard it all over the news, social media, everywhere. So masks, wash your hands, social distancing. But let me go back to masks. So initially we thought that you had to have like, you know, full moonsuit, hazmat and 95, everything to prevent it. And then we realized, no, it's spread by droplets. So, you know, cute little cloth masks that your auntie makes you are probably good enough. And then, oh, but no, maybe it is actually aerosolized and airborne. And so now we've gone back to healthcare providers and anybody that is dealing with somebody who potentially has COVID or is known to have COVID needs to go back to using N95s. But basically masks work, masks decrease the spread. We now know that it's respiratory spread far more than on, you know, by sneezing on a doorknob, for example. So, you know, wear a mask. We also know, I mean, washing your hands, you know, like wash your hands. We've all known this since we were little kids, just wash your hands. And then social distancing. So if we stay six feet apart from each other, but you know, is it exactly six feet? No, it's somewhere around six feet. Further is better. Are you outside? Are there air currents to disperse the droplets faster? Are you somewhere where they have a really good HEPA filter? And also we, you really want to decrease the number of particles you inhale. That's another reason why we think that cloth masks are helpful, even if they don't keep out, you know, 100% of COVID particles, because there seems to be a correlation between the number of particles you inhale and how sick you get. So it's thought that people who inhale a whole lot of COVID virons get a lot sicker than people who may have only inhaled a few. Again, we're still learning about this. We're still learning, you know, monthly there's new information coming out. So just because we said one thing a few months ago, and now we're saying another, I mean, that's science. That's what's good about science. With more evidence, you have new information to go on. So it's still evolving. And unfortunately, unlike other diseases, we don't have a preventive medication yet. We don't have a vaccine yet. We don't have a way to really prevent it from that standpoint. So we're still stuck with wear a mask, wash your hands, and still keep social distancing. Let's talk about prevention in Kubernetes and cloud native. It's the exact same thing. It's a list of things you and your team can do to try to prevent it from happening to you. I mean, that's exactly what we're doing here, where we're coming up with tools and techniques to try and prevent unwanted behavior, unwanted actions from taking place. And the stronger our prevention technology is, the stronger our prevention policy is, the healthier our system is going to be. I mean, it's literally a one-to-one relationship, just like Rachel was talking about. So the first thing is, is role-based access control. Access control. Who can do what? Can this group of users access this system? Can this group of users access that system? And how well are these systems defined? How many parts of the system are actually lumped into this access control? And that's where we get into things like admissions controller, which allow you to control what is and is not admitted into your Kubernetes cluster. Furthermore, we can take that a step further and look at things like security policy. This is everything from pod security policy to open policy agent, OPA, and other concrete implementations like gatekeeper. So you were starting to look at how we're expressing what people can and can't do, what users can and can't do, what clients can and can't do inside of our Kubernetes cluster. And the more rigorous our security policy is at preventing these things from happening, the less likely it is that our cluster is going to get compromised or that we're going to get sick. So yes, there are things that are very, very identical to masks. They're not foolproof. For instance, if you look at RBAC, if you run a privileged container in RBAC, in an RBAC cluster where you only have access to a given namespace, you can still escalate it down to the node compromising to how your cluster escalated out of that namespace, but it's still effective for people who don't know about that attack factor, a well-known attack factor that's been out for years. So again, we're in the situation where we're just trying to strengthen our preventative techniques in order to minimize the amount of potential compromise we see downstream. We go and take that further with things like regression testing. How do we prove and disprove that things that have happened in our cluster that we didn't want to happen are no longer happening again? We take that a step further with things like kernel controls. How do we control which processes which users can and can't execute which system calls? At the end of the day, all applications need to translate to the kernel somehow and that's where things like glimc come into play that ultimately translate directly into system calls. And if we're able to control what users can and can't execute various system calls, we're able to control which parts of our cluster and parts of our system can and can't get sick. So we see this all throughout the ecosystem and it's absolutely critical for any type of security policy. It's just like washing your hands. You've got to wash your hands. You've got to turn on our back. Okay, next weekend. Go for it. Testing, detection. How do we figure out who has COVID early to prevent more spreading? So in medicine, every test has sensitivity and specificity. So sensitivity is if somebody has a disease and you test them for it, do you actually pick it up? And then specificity is if you test somebody and your test comes back positive, do they actually have the disease or is it a false positive? So it's not enough for a test to be really sensitive but not specific or vice versa. It has to be sensitive and specific to really be able to trust your results. So right now we have rapid testing like point of care testing, all these sports teams, politicians, they keep getting tested frequently and rapidly. It hasn't really been rolled out to the general population yet unfortunately. And then we have PCR based lab confirmatory testing. So that's the swab that everybody knows about that goes way up. And the joke is it's a brain biopsy because it goes so far back. It has to go to the lab. It takes hours to come back. Generally it's functionally you get your result in a day or two. I mean that's not really good enough to really be able to quell the spread. So if we want rapid testing, the problem is rapid testing is not as sensitive as lab confirmatory PCR testing. So the question is, is that okay though? And it turns out there's an infectious disease expert, Paul Sacks, he strongly is pushing for rapid testing. Everybody as frequently as they want to be tested, widespread testing, even if it's not as sensitive, because if you can test more people quickly, cheaply, repeatedly, you're going to do a better job of lower sensitivity than you would with a very high sensitivity, high specificity test that you can only do in a limited basis that takes days to come back. Beautiful. And I feel like I don't even need to say anything after you get done talking because these analogies are just writing themselves. We see this exact same thing in Kubernetes. You want to be able to check the false positives and in some cases as you're debugging your systems from a site reliability engineering perspective, it's just about getting things out there. And I would rather have, you know, if I have a thousand nodes in my cluster, I would rather know that 90% of them are up and running and healthy than know that none of them are up and running and healthy, especially if that 90% of them that are healthy comes with a small amount of uncertainty, right? So that's where we're getting into like the computer science terms of having a false positive or proving something that looks like it's actually not working when all in actuality it is. So the lesson here is quantity over quality patches and these patches can be everything from software patches to actual system infrastructure patches, but we're trying to get things out there to get our systems back online and back as healthy as possible as quick as possible. Furthermore, we look at other detection primitives like Falco. I mean, we take it down to the system call layer and if you want to come in and see what's going on at the kernel, Falco is going to be the number one way for you to take that information and pull it up into user space so that you can start detecting runtime security anomalies. And there's already a complete rule set defined for you so that you can actually understand what's happening out there. For testing, look no further than regression testing. You can write a regression test for more than just software. If something goes into your system and violates the network or violates the kernel or violates the infrastructure or violates your storage, you can write a regression test for that if you have a good infrastructure testing suite set up and running. So again, it's just about knowing what's going on. Once you know you're able to make decisions, we all remember what it was like at the beginning of quarantine where nobody knew what was going on. And that was a very stressful time for us to actually make any concrete plans for our families because we didn't know what was going on. So in computer science, please, please, please understand the value of simply being able to say if something is true or false. All right, let's talk about the future. Let's talk about the vaccine. Or as I like to put it, oh my God, somebody finally patched open SSL, which if anybody here has ever followed the open SSL as CDEs over the past few years, this is a quite a very exciting moment for the internet when something like this happens. The vaccine, of course, everybody is hoping it will come out like yesterday. What we're going to need though, once we have a vaccine, we need a vaccine that's going to achieve 70% immunity. And where that number comes from is the concept of herd immunity. We need the vaccine to both be deployed widely enough that everybody can get it. Or at least about 94% of us can get it. And then we need that vaccine to work on as many people as possible. So let's take an example of a vaccine that's been out forever that we know about, which is the pertussis vaccine, the whooping cough vaccine. We all get it as a kid several times we get updates throughout our life. It's commonly called the Tdap. It comes with a tetanus vaccine. It turns out there's a certain percentage of the community that is never going to mount a response, an immune response to whooping cough, no matter how many times you give them a vaccine. So what we have to do is give the vaccine to 94% at least of the population that's the target so that those people who never mount a vaccine, it doesn't really matter. There aren't enough targets for the disease to keep spreading throughout the community. So once this vaccine comes out, we still aren't going to be back to life as normal until enough of us get vaccinated that 70% of the folks in the world really are immune. So that means either 70% of people get vaccinated and it works 100% of the time or 100% of the people get vaccinated and it works 70% of the time. Vaccine phases come in three phases. Phase one and two are basically to prove that number one, the vaccine actually does what we think it does on a very small scale. And number two, that it is safe. Like we don't give you the vaccine and then you get some horrible auto immune disease or you don't die from it in an extreme example. So there are multiple vaccine candidates that have gone through phase one and phase two and we have several now that are into phase three, which is where we start vaccinating thousands of people, volunteers of course, thousands of people. Well, in the U.S., there are volunteers in other countries. They've taken different approaches where people aren't given as much choice of whether they get the vaccine or not. But in the U.S., it's volunteers. And then we have to wait for enough time to see if these volunteers actually develop the disease at the same rate as other people who are not vaccinated or if they don't develop the disease and the vaccine works. And then once phase three is completed, then we roll it out to the general population. That's going to take a while. I'm not in the practice of predicting the future, so I wish I could tell you what day, what month it's going to happen. I can't, but things are moving along and things are moving along a lot more quickly than they have in the past for any other vaccine that we've tried to make. All right, so let's talk about patching production, right? So this is the analogous of this. How do we cure production if something goes wrong? So let's say that there was a malicious user that violated our system and came into our production cluster and started doing something. Let's say they were mining for Bitcoin. And we wanted to go and fix that. Well, the first thing we're going to do is we have to stop it from spreading any further. The second thing we have to do is we have to get rid of the user and we probably want to understand how that happened, and that was the point of the exercise of the postmortem. And now that we understand that, we are ready to update our security policy and update our security posture such that we would prevent this from ever happening again. So this is a way of effectively introducing a new change into our system that will prevent this malicious user from coming back in and taking some sort of action and causing harm later. So how we do that is we typically want to deploy something, whether it's a patch to production, a new container image, or a new artifact for our application. Perhaps it's not even has nothing to do with our application. It's part of the software we're running on top of. In some cases it could be part of Kubernetes itself, in some cases it could be part of Linux itself. Regardless, we want to make sure that we are able to prove beyond the shadow of a doubt that this is in fact what we want to be deploying once we actually deploy to production. And that's where image providence and artifact providence comes into play, proving that it is what we think it is, making sure that we're not somehow compromised as we're trying to inject the new fix into our system. We look at concepts like A and B testing that allow you to have an older or vulnerable set of infrastructure out there and then flip back and forth between the recently patched and the legacy system to see if there's any differences, to see if anything has gone wrong with the new change, and that allows you to do things like slowly dial into the new system so that you're not just doing a cold cut over and compromising every piece of infrastructure in your cluster. And that's where partial patching comes into play. Just you can start off with 10%, move to up to 25, 50, 75, 80, 90, and then finally the whole thing is cut over. And again, a perfect example of this is patching SSL, right? There's a vulnerability, it sits in every piece of software virtually out there, and then how do we go through and how do we begin to fix that across the board for us? And that could be a tedious process. So making sure that you have a good story in place for how you're going to go and patch this is going to be critical. Furthermore, if you're not even fixing anything, which is having the story in place in the first place is going to be a critical part of your stack as you begin to look at adopting and using Kubernetes. All right. So for any disaster, you have to have a response plan in place. And for COVID-19, that's really been super challenging this year. I'm sure everybody is fully aware. Although I guess it depends on which part of the world you live in. But in the United States, there was a pandemic response working group that was proposed back in 2016. It involved scientists. It was a working group to make a plan because it was apparent that based on our history, there is going to be a pandemic again at some point. We already knew about Ebola. We knew about SARS the first time around. We knew about the global flu pandemic back in 1918-1919. So in 2016, the government had a robust plan. It unfortunately was disregarded and not put into place in the United States when it was actually needed. But previously, there had been a plan. But this is a good example. You need to have a plan. You can't just wait until something bad happens and then scramble retroactively. You should always be proactive about the sort of stuff. We should have strategic stockpiles of PPE. Apparently, the United States had the stockpile. I can't speak for other countries, but the US had a stockpile. It did not get deployed efficiently. We also have a US stockpile of critical medications, including antibiotics, like for example, antibiotics against anthrax, airway and breathing medications, things that are absolutely critical. We do have a national stockpile of that. And then you have to have a deployable chain of command that is identified in advance and then able to be rapidly deployed. A phrase I used earlier, rapid deployment of pre-planned scenarios and not wait retroactively to scramble once a disaster happens. Beautiful. Absolutely couldn't agree enough. This is the first thing I tell people when I walk into an SRE team, especially a new SRE team, is what are you going to do when something breaks? What are you going to do when it's the fan? That's a real conversation that a lot of people aren't prepared for or even more so. There's this false hope that we are prepared for it, but nobody actually knows where the manual is. We see that a lot in disaster recovery because it's never going to happen to you and of course until it does. Being prepared is a huge lesson anytime you're looking at it. Do any sort of resilient distributed systems engineering. So the first steps to recovering a system, you have to roll back, you have to revert. If you are not ready at the push of a button to go back to the last release, to go back to a known working state, to just get the system back online so that you can stop and breathe for a moment so that you can figure out what's going on and what you want to do next. You're going to be in a really rough time when disaster does inevitably strike. We also see chain of command. Somebody's on call and they get a notification. That notification is escalated to an alert, that alert to a critical alert. Next thing you know and your production system is offline, you're compromised, something's going on and what's next? Who do you call? What button do you push? What email do you send? Where do you go? There's nothing worse than getting a page at three o'clock in the morning and feeling completely isolated all alone without knowing what to do next and trying to fix the entire thing by yourself. So having a chain of command, knowing the next steps and it being second nature so that you don't even have to think about it is going to be critical to any sort of healthy response plan. Documentation. Where do I go? What do I do? Which book do I pick up? Which page do I turn to? These are all important things that are going to be critical as you and your team start actually being in the moment at runtime recovering from some sort of incident. And last but not least, a regression. A regression is just a fancy way of saying I want to prevent this from happening again. So in computer science we write regression tests. These are ways of proving that you and your system are not going to be vulnerable or liable to the same thing that they were in the past. So not only understanding the regression but then later testing against it to make sure it's not happening again is going to be absolutely critical for you and your team. So again, let's do a quick overview. We look at the outbreak, the warning signs, and we looked at what it was like the day we got the first alarm especially with the pandemic out of Washington State. We look at responding to that, the importance of quarantine, known infection or known vulnerabilities. And we talked about the problems and the good things with information and misinformation sharing. Further response, we talked about the importance of prevention and detection. And we looked at a look at the future. What's next? What's the disaster recovery plan and what does the vaccine look like? Rachel, is there anything else you wanted to add before you wrap up? No, other than this has been really fun and I, you know, computer science is not my background but I feel like I learned a lot and I appreciate you letting me help you with this. Yeah, absolutely. Thanks for being here. So thanks for your time everyone and we'll let you get back to the rest of your computer. Thanks again.