 All right. Can everybody hear? Wonderful. Okay. So this is our talk on preventing Doomsday. It's just going to be all about X509 certificates and Cloud Foundry, and how to make sure that you don't experience a Doomsday. So I guess we should just start off with introducing ourselves. My name is David Daubmeier. I am a Cloud Engineer. My name is Tom Mitchell. I'm also a Cloud Engineer. So I guess we should jump straight into it. Why do we use X509 certificates? Well, we can use them for a couple of reasons. One of the reasons is authentication. It proves that a server is who they're saying they are, and for security, that's one of the key components. The other thing is for encryption. That's what most people use these for. You'll see mutual TLS for the encryption capabilities, and within Cloud Foundry, they're used for all of the internal communication. So you want to ensure that all your traffic's encrypted. We do that with certs. So the logical thing with certs is you create them, and at some point, they expire. Why do these certs expire? Why don't we just set them to 100 years and forget about it? By the time they expire, it's the next person's problem. Make it five years, make it 10 years. Make it not my problem. Well, the reason for it is kind of interesting. One of the logical reasons is for certificate revocation lists to allow them to shrink over time, because otherwise, they'll just keep growing and growing and growing, and your CAs will have a bit more trouble. The other actual reason that I like to say is it will allow new technologies to eventually be rolled out. You don't want to be using certs that are encoded with SHA-1. Let's get those out of here, and having yearly or two-year expiry allows you to eventually get those rolled in. The remainder of the reasons really fall into when certs should be revoked, but don't end up on a revocation list. They should expire eventually, so that you don't just have bad certificates out in the wild forever. Yeah, absolutely. And one of the reasons for that is, say you have turnover in your team, or you have a cert that gets leaked. You don't want that to be around forever, especially if you don't know about it. The more you turn over your certs, the less chance there are for a cert that expires, or for a cert that does not expire, and a cert that is leaked, to be used maliciously. So this kind of brings us to the what if certs expire. What if we just forget about it? What if we just say, oh, that's not really that big of a deal. Well, logically, things break, and they break in really interesting ways. And typically in ways that you don't expect, simply by it being on a timer and not caused by an action of your own doing, it can feel bad to just do nothing to the Cloud Foundry for the past week, and then wake up the next morning, come in to work, try to log in, and you can't push apps, you can't do anything, and then you look to the engineers around you, and you say, did you touch anything? No, I didn't touch anything. Did you touch something? Chuck, yeah, who did something? Somebody did something. I had it in a cell. Did I break it? I had it in a cell. So the first way that you've probably experienced expired certs is with this screen here, in some form or another. You get that your connection is not private, your users might get this, your customers might get this, and it's scary for a lot of people, and it should be. That's the point. But in the grand scheme of things with Cloud Foundry, this is one of the best outcomes. If you hit this, it more or less means your front end load balancers have expired, you throw a new cert on it, and you're kinda happy things start flowing, things are getting trusted. That's the easy fix. Right, your backend is working, but no one wants to use it, which I guess is ironic and sad, but that kinda brings us to the other point. While these are the front end certs that most of your customers will see when they go to your site and they get the little green lock, Cloud Foundry has a lot of certs. And more than just front end of facing ones, about a couple years ago, the CF team added to what was at the time to CF release a version of Cloud Foundry that has mutual TLS for pretty much all of the components that talk to one another. Any component that talks to another component, it has a cert and expects the other and to have another cert that it trusts. And if any of those break, you're just severing that tie and they won't talk to each other. And it can break in a whole lot of interesting ways depending on which cert expires. Yeah, so you'll notice when certs expire, they may not all go at once. They'll all go within a very short period of time because they were all generated when you deployed this Foundry. But if one cert expires and this component doesn't talk to this component, it may not be a big deal immediately and you may not see the effects as a message that says, hey, this is a cert that expired. You might see an issue where it's can't talk to BBS. You just, this component just decided not to or you can't push an app or just you're getting weird timeouts and things like that. It's not always apparent what's happening. And that's a scary thing. So in that case, something's gone wrong. You've gotten a call and you're trying to figure out what's going on. And the first thing that you're going to do in a lot of cases is you're gonna go to Bosch. Say, hey Bosch, what is going on? And in some cases, you'll see Bosch say that everything is failing. Everything is on fire. Nothing can talk to anything else. All of the processes just died. Everything's bad. This is not always the case. You may log in and some things are just fine. Everything shows us running. Some components of Cloud Foundry are sort of bad at turning off when they're broken. And so they'll just run Bosch. We'll say, yeah, everything's fine and running. And you, for some reason, you can't like route to your app and you have no idea why. But the thing here is, this is assuming that you can get into Bosch. We use Bosch to deploy Cloud Foundry. That means Bosch came around hours, days, maybe a week before you deployed your foundation. At some point before, well, Bosch also has certs. And Bosch has some really important ones. So you try to get onto the Bosch director and sometimes you're met with this. Hey, your cert's expired. I'm sorry, you can't talk to me anymore. You have no option to skip that. You can't say, this is emergency. Let me in. No, you can't do it. I'm sorry, I'm not letting you in. So your only point of recourse here is to redeploy your Bosch director. That is the first step in this process to get yourself back online into a healthy state. So now you're in a state where your Cloud Foundry's down. You're trying to get onto the VMs of your Cloud Foundry to figure out what's wrong. You can't get into your Bosch to do that. And because oftentimes if the CA cert is the thing that expires for Bosch, Bosch will stop trusting its VMs and it will start to try to recreate them. Because what you'll notice, when you end up redeploying Bosch and you end up getting in, you're like, okay, I'm back. I can now see what's going on. Show me my Cloud Foundry deployment. And it'll look like this. All of your agents are unresponsive. Well, that's because none of the agents are reporting into the Bosch director because either the cert changed, which it did when you redeployed it in order to get in because you needed new certs because they were expired, or they expired and they can't talk anymore. So at that point, your agents aren't reporting any of their health, so they go unresponsive. And Bosch has this wonderful thing called the Resurrector and we use it a lot to save us for when a machine just goes down, Bosch will rebuild it. But when you lose things in this sort of scenario, Bosch will try and it will try and then it will realize something is very wrong here. Once about half of the VMs in the Bosch director are marked as unresponsive, it'll go into meltdown mode and it'll stop trying. So basically what'll happen is about some percentage of your VMs will just disappear and then Bosch will say, something's broken, I'm just gonna stop. And now those VMs are just missing. Basically, you don't wanna let the certs expire because it's gonna lead to a large amount of downtime. Your cloud foundry is down because it's broken and you can't get into the cloud foundry to because Bosch is down, you have an outage and that's not good and this can be a rather long outage because rotating all of these VMs and certs can take a long period of time. You really wanna try and catch this before they expire because then you're just doing your HA whatever and everything's gonna be fine. Yeah, at one point I had a scenario where I was on engagement, I got this call and you always get the call and it's, we can't get in, we think it's something to do with certs. And at that point, your heart kinda drops and you go, is it, which environment is it? And of course it's always production. So at this point prods down and you have to do your cert rotations and cert rotations take time because in order to rotate these critical certs like the NATS one with Bosch, you have to rebuild those VMs and you have to recreate them and that takes a lot of time. If you have a lot of cells, you're building a lot of VMs and during this crisis, I've had an experience where you go to rebuild it and then you have an IaaS error and it says, we're out of resources in this region. I'm sorry, we can't give you those VMs back. I know they're gone, but we can't give you them back. And then you're really in for an interesting time, you're trying to do config changes on the fly to get that done and move to a separate region. And that's not something you wanna do in sort of a panicked situation. You want to catch this ahead of time, you want to do your preventative maintenance and rotate your certs. Is that always going to be the easiest process? No, but it's a way that you can mitigate having an experience like this where you have a very long outage and a whole bunch of confusion and nobody's happy. There's also a very unfortunate thing that happens when you have to recreate all of your VMs is that there's something wrong in everyone's environment and you will discover it when you try to recreate every VM in it. This is when the problems come back. This is your doomsday scenario. So how do we mitigate this? How do we remember to rotate these certs? I've seen tons of different ways. How do we do it? Well, the easiest way is you deploy a foundry and you put in your calendar reminder for next year that says, hey, it's 9 a.m. on April 1st. Happy April Fool's, everybody. Did you remember to rotate those certs? Oh, I was on vacation. I was at the summit thing that I was prepping for. I missed it. I had seven meetings today and I just overlooked that reminder. It didn't seem important at the time, whatever. Right. Meetings get lost. You need a more robust system than just that. And that's just one reminder for one foundry. So what a lot of places resort to, if you use- Speaking of our robust systems, yeah. Yeah. Is like a Google Sheets, your spreadsheet. Yeah. You'll notice the name that just says, certs that can't expire. There's one person who can open this sheet. He's your middle manager. Yep. Andy probably opens it when you tell him to update the certs, which is after they've expired. He will ask you why you didn't look at it when they do expire though. Yes, he will. So we need something better. We need a better solution. And that's really what we've come to. You need some form of notification center, some form of dashboard that will also nag you and make you rotate those certs ahead of time. And that's where we start to get into Doomsday. Doomsday is a project that I started probably about a year and a half ago now. And like all projects that I'm passionate about, it started out of anger and frustration because I was sick of fixing these broken environments. And so said, hey, how about we actually write something that monitors your cert expiration and gives you an easy sort of single pane of glass that shows you what the current state of things are and when you have to get things renewed by. So we have a web dashboard. It's in progress, but that is what it looks like as of a week ago. We wanted to convey as much information as possible in a very concise thing that you can just have on the wall and the colors change. Things aren't green all the time, but they should be. If you see yellow, you got a couple of weeks. If they get to orange, you got problems. You need to get these out fast. If it's black and red like you see at the top there. It's too late. It's, they're gone. It's, you messed up. That's it. This is your Doomsday. It would be ironic to like put a nice big clock at the top that counts down in a comical fashion. We might consider it. We got a lot of terrible ideas that I have to reject. Yeah. And just scrolling down the page, you can see the color gradient moves as you're getting closer into your safe zone, which is we like to be somewhere around a month to two months out. Get your certs done, rotate them early. That way if you got some issues, it's not a big deal. So if the web UI isn't for you, let's say you're very CLI focused and that's kind of your shtick. Well, we've got a CLI and a dashboard there for you too. So. So I had made the CLI just as far before the web UI, just as this is how you use an API based thing. What one of the folks in our company ended up doing, which I thought was pretty clever, was he just threw the dashboard command in his dash or C, which this doesn't print out anything if there's nothing expiring soon. So basically if something's expiring soon, when you log into your shell, which you probably do most mornings, it says, okay, you have this stuff expiring soon. You should probably take a look at it. So I guess at this point, you're probably wondering, well, how does it know my certs are expiring? How does this magical thing, just all of a sudden, no, I have all of these certs. Maybe you don't even know how many certs you have in Cloud Foundry in your vault or where your certs are. That's an important thing to know. So getting into some actual configuration, we have a couple ways of doing this. So the easiest one is just TLS Client. You give it a list of a whole bunch of domains and say, go check these for me, and it'll go out, it'll pull all the information, it'll add it to the dashboard. So that's one way you can monitor your front-end load balancers that get provisioned from your routing team three months in advance and have the wrong sands and you have to have them reissued. That's how you monitor those kind of things that are publicly facing that you may not have direct control over. The second one is HashiCorp Vault. That's one great secret storage location. You have a bunch of certs in there if you're using that to manage your Cloud Foundry or maybe not even Cloud Foundry. Maybe you're using it for CF and a bunch of other deployments. It will comb through your vault, it will look for certs, it will find them, it will add them to the dashboard. Can I take these ones? Sure. These are the Pivotal Approved Things. This is Ops Manager in CredHub. So it can also scrape your Ops Manager's Credential Store if you've ever gone to the Credentials tab of a Ops Manager installation. It can look for certificates in there. And also CredHub, which is another key value store, which many of you are probably familiar with. Just walks the tree and anything that looks like a cert, it will track it and keep track of when it's going to expire. So that's the thing that we wanted to cover is we wanted to try to cover all of our bases and make sure that no matter what flavor of deployment you're using, what software you're using, we don't care. We want to prevent your certs from expiring. And the way that we have to do that is by nagging you. And the best way to nag you is notifications. So currently we support two things. We support Slack, which a bunch of us use in the community. It will continuously nag you. However often you want to set it. We tried it out for 30 minutes at one point. It's maybe too nagging. But at least once a day, once a week, let me know when these things are expiring. The other thing is Shout. It's a notification gateway built by Mr. James Hunt. And that can talk to many other things as well. It keeps track of the state of your notifications so that if something's been passing repeatedly every 30 minutes, it's only going to tell you when it switches to passing as opposed to every 30 minutes. And then again, when something's close to expiring, it'll tell you then as opposed to all the time. So running a server is easy. This is actually what it looks like when you actually spin it up. It'll initialize. It'll tell you when it's scraping things. It'll tell you how many things it scraped, how many actual certs it found, so on and so forth. At this point, it can be run locally. It can be run as a Cloud Foundry app. Push it up there to that Foundry that you want to monitor. It can be run as a Bosch release. I will be making that public probably at the end of this talk to CF community. So it'll be there if you want it. And that's pretty much the gist of Doomsday. Keep things ahead of time. You need to do your preventative maintenance. Otherwise, you're going to pay for it. You either pay for it ahead of time in small little increments or all at once. And at this point, I guess we can open it right up for questions, yeah? Why don't, so the question is, why don't we have a way to automate rotating of all of the certs? Well, it turns out that's a very complex problem. There are a lot. You can do it. You won't like the results because that involves constantly just updating and recreating the VMs that have certificates on them or restarting the processes that use the certificates which can cause things to go down at times you don't want them to. Many places have changed windows in which they need to keep all of the downtime within. And if you just have some sort of process that renews something and restarts processes in the middle of the day or even just whenever you tell it to, it may end up having results that you don't like, especially if that recreate fails. Right, yep. Okay, so the question is, when does the timer start on those certs? And when is it ever updated? I think it's the corollary. So the question of when does the timer start? It starts when you initially deploy your foundry. So when you get things initially deployed, it will generate a whole bunch of creds for you and it will stick them in your credential store of choice. Those get updated when you tell it to. It's not an automated process at the moment. While I've seen endeavors to do that automatically, there are certain hoops that you have to jump through in automation and it can be done. It's just a very high level of complexity to do that automatically. And many places don't have the risk appetite for it. Yeah. So this is something I've actually been thinking about in the past month or so. So the question was, have I considered making a Prometheus exporter for the stuff that Doomsday does or for Doomsday itself? And the answer is yes, I have. What the original problem was, was that it was hard to get the visualization to give it to you in Grafana in the way that you wanted to. You were making a lot of sort of sacrifices to get it into Prometheus just. But at the same time, it would be a boon to get into Prometheus because a lot of people actually just use Prometheus as a pane of glass. They look for all of their alerts, especially if you have alert manager built in, you get that, you get the notifications for free. At this point, it's definitely on my list. I wanna get Prometheus, or rather, Doomsday to export the stuff in a reasonable format to Prometheus so that you get that for free. That was a very long way of saying yes, I can say that. Well, I'm happy to know that I have somewhere to copy the code from now. The leech search is actually not too bad, but it's hard to do any of the other trust changes with any degree of no downtime and they all get generated at the same time. So, yeah, have you found any functional ways around that sort of deliberately generating CAs or limiting the number of CAs? What are you and what kind of stuff? So, yes and no. For Cloud Foundry specifically, there are a bunch of CAs that get generated. You can, if you're bringing your own certs, you can absolutely have, say, an internal CA that's valid for 10 years and that's part of your corporate trust system and then you can sign everything under that using an intermediate CA. Yes, you can do it. The other thing is for what if you can't? So, what if your security appetite or your security team says you're only allowed one year certs and we have to use the best practices and that just happens to be what the defaults are and the default is a year? Well, you can do certificate rotations in Cloud Foundry with no downtime. I've done it, I've done it and it works as long as you get to it ahead of time. So, the way that you end up doing it at a high level is you generate your new certs on the side and then you have to concatenate the CAs and at this point, I believe all components of Cloud Foundry will be able to handle that and trust both of them but it involves multiple deploys of Cloud Foundry. You will be doing two, maybe three deploys of Cloud Foundry and that's just Cloud Foundry. So, you also have to do the same process with Bosch and that's everything that Bosch handles as well. That's the current way to do it. It does work, it's a bit tedious and you may want to scale up ahead of time. Yeah? Is Doomsday open source? Doomsday is very much open source. The link is right here, well, here? There you go. That's the one? You got a lot. PRs are most definitely welcome. If you have a burning issue that you need solved and you have the time to do it, please submit a PR, it'd be great. I know, Tom, you're doing refactors and stuff at the moment. Oh, it's already pushed, I'm that good. All done, so. It's all done. Doomsday's done, everybody. Yeah. Yep, all new features are gonna be a work in progress as we have time to do it and as people need it. Everything's demand-driven and anger-driven in our case. Yeah. Why doesn't this have something? There's a lot of stuff that isn't supported in Doomsday, especially in notifications like what if you're using Microsoft Teams? Why doesn't it support that? I don't know, I haven't had time, that's all. That's the entire thing. Yep, there's a lot on our backlog. We're going to get to it mostly eventually, but suggestions are most definitely welcome so we can figure out what we should do next. Issues are also welcome. Yes. They tell me what I need to do in a good order. Yep, and maybe a use case that we haven't tested yet. All right. Hey, that's all questions, I think. I think we did all right. Thank you very much. Thank you.