 Welcome to Face Your X509 Fears, Automating Certificate Rotation for Clifandry. My name is Irina Shostava. I'm an engineer at Pivotal. I've been working for about a year on the CF Release Integration Team. And before that, I was anchoring the Crathop team. Welcome to the stock. So today's agenda will be a little intro to public key infrastructure, also known as PKI. We'll talk about why certificate rotation matters to you. Then we'll go over the sort of high-level algorithm for the rotation. Then we'll look at PKI in the wild with CF deployment, like a real-world PKI, followed by a short demo. I'll show you a real-world sort of pipeline that actually performs certificate rotation. And we'll conclude with the future of certificate rotation. So let's talk about public key infrastructure. In cryptography, PKI is an arrangement that binds public keys to their specific entities. These entities could be either people or organizations. And this binding is established through a process of issuing certificates by a certificate authority. So to sort of recap some of the terminology that I just presented to you, a digital certificate is a document that proves ownership of a public key. And it includes important information about the owner of this public key. Certificate authority is just an entity that can issue certificates. Another term that's used a lot in TLS context is X509. So X509 is just a standard defining a format for public keys. And this format is used to encode various information, and more specifically, the owner of the certificate, the expiry, the certificate authority, and so on. So in this example, we'll look at the certificate chain. And there will be a certificate chain used by Wikipedia. So on the slide, you'll see a root CA. And as you can see, highlighted in green, there is an issue in the subject, which are the same. And highlighted in orange is the fact that it is CA. That means that this is a certificate authority that has authority to issue certificates. Next up is the intermediate CA. So that's kind of the second link in this chain. You can see that the issue of the CA is just the root CA from the previous slide. And the subject is just the specific information about that entity. And again, you can see highlighted in orange that it is still a CA. So this certificate has authority to issue other certificates. And last in this chain is an example of a leave certificate. You can see again that the issuer is the intermediate CA from the previous slide. And the subject is just the Wikipedia.org site. And it no longer has an authority to issue any certificates because the CA property is set to false. So what is the role of certificates in modern distributed systems? And really, for the slide, I could have named it as role of TLS in modern distributed systems. But since certificates are so commonly used for TLS, more specifically, it's pretty much interchangeable. So imagine you have a distributed system where some components are talking to other components in various ways. And as of right now, the way I presented it to you, there are currently two problems with the system. And certificates will help us solve those problems. The first problem is that there is no privacy. So all the data is transmitted in the clear. And moreover, anybody on the network can eavesdrop and listen to the communication. To address this first problem, we can just say, let's encrypt all the traffic. And the naive approach to this would be to use the public key from the certificate and use asymmetric encryption for that. Then each server could decrypt the traffic with the private key that corresponds to the specific public key. Now, in practice, this approach actually ends up being computationally expensive. So it's not being used. And instead, in the case of TLS, asymmetric encryption is used only for the handshake part. And the rest of the traffic is encrypted with symmetric keys, which are established during the handshake. So again, here we see how certificates are used in establishing the handshake for encryption. So this gives us privacy, which is great. This is the first step to making the system secure. However, we still don't know who those servers are. So there's no authentication that is performed as of yet. And this presents a problem because you are vulnerable to the man in the middle attack, which means that somebody could pretend to be a server on the network. But in fact, they will have a malicious intent. And they could still be able to see traffic and decrypt the traffic because you're not verifying who you're talking to. And we can solve this problem as well with certificates if we perform certificate verification. In this case, let's say we're viewing the case of one-way TLS, a client who will verify that the certificate presented by the server is good, and that's done in two ways. First of all, you will do cryptographic verification where you will verify that the certificate is signed by an authority that you, in fact, trust. And in addition to that, it can perform other checks. So obviously, you will check certificate expiry and host name verification, and so on and so forth. So this gives us authenticity. And privacy and authenticity together give us TLS. So the TLS benefits and the power of TLS really comes from those two properties. So let's talk about why should you care about rotating your certificates. Or the obvious thing that comes to mind is that they tend to expire. I believe CF deployment certificates have a default expiry of one year. You can change it at, but it's really not the default setting. Credentials may be leaked or stolen. So in this case, you want to be able to react quickly. And hopefully, you'll have a process that is streamlined so that you can mitigate the risk of leaked or stolen credentials. And furthermore, you can go and take step two even further and just rotate often and go fast to mitigate that risk. Rotate often, even though you don't have credentials that are leaked or stolen or you don't have certificates that are expiring. Let's look at the PKI in the real world. So we have CF deployment, which is the bash manifest or the canonical bash manifest that we're using for deploying Cloud Foundry in the open source. And here's one example of a certificate authority that CF deployment uses. It's called Service CF Internal CA. And in this case, I'm including just two examples of certificates. There are, in fact, a lot more CC TLS and Blob Store TLS. And moreover, there are more of these certificate authorities in CF deployment. So pretty much every subsystem in this large distributed system is using its own sort of PKI. So there are about nine of them, and they all sign a various number of certificates. So again, you want to be able to rotate these certificates in a way that will not cause any downtime for you, hopefully. So what is the high level rotation algorithm? And here I should disclose a little caveat in that I've been pretty much using CAs and certificates interchangeably, but there is a difference in the algorithm in that certificates can be rotated in the single deploy where CAs require a three-step rotation. So I'm not going to really cover in this talk the certificate rotation, the leaf certificate rotation, because it's kind of like a simpler case of a C rotation. And instead, I'm going to focus on full rotation, so a C rotation. So step one would be to add a new CA. In this step, you would want to configure everything to trust both CAs and redeploy so that component trust stores are updated with these new CAs. Step number two, you would want to regenerate all of your certificates with new CA and then deploy again. And step number three, you want to remove the old CA. That's where you configure everything to only trust the new CA. So it's kind of the reverse of step one and redeploy. Why we need three steps for rotation? Why can you not do it all in one step? Well, the reason is that in a typical Bosch deploy, components get updated in a certain order. And really, some of the instances within instance groups have rules about how they get updated. For instance, there are canary instances and so on. So if you consider, for example, Diego API, which is also known as BBS, which is trying to talk to the service that it's using for locks, which is called LockIt. Both of these services are deployed together on a single VM and that VM has two instances. So you could imagine if you were trying to update CAs and certificates at the same time, you will end up with a situation where one of the VMs will have all the new stuff and one of the VMs will have all the old stuff and your high availability will be broken. So in fact, in this case, I believe Diego prevents you from that kind of situation by just failing to start up. So you will have a failed deployment in that case. All right, so implementing the algorithm, that's where we're gonna go and look at the real world pipeline. Let me update my displays. Okay, so this is the simple pipeline that incorporates the algorithm pretty much. This pipeline takes about an hour and a half to run end to end for a vanilla CF deployment. So I'm not gonna have you sit through that and I'll just show you the results. So as you can see, here's each concourse job that represents a specific step that we just looked at. Step one for adding new CAs, step two for regenerating certs, and step three for removing the old CAs. Now let's dive a little bit more into what each of these jobs does. So the step A kind of of the first step is to configure everything to trust both CAs. So this is done, unfortunately, currently there isn't really APIs for doing that out of the box. So I've built a little sort of glue code that I call the rotator. And you can, it's a command line tool that essentially will make the appropriate calls to credit up API as well as ensure that all the configuration is correct for the box director. So in this case you just say rotator add new CAs and it will find specifically root CAs to rotate. And as we saw, there are really nine of them so it will tell you that you'll regenerate nine CAs. Then you botched deploy and for botched deploy, I've used some of the release integration tools. We're providing a lot of tools to help you test Clifandry and one of those tools is called Uptimer. Uptimer helps you measure obviously Uptime during deploys. So here I really wanted to show that what kind of downtime we're gonna be seeing throughout the rotation. I'm gonna scroll here for a bit because there's a lot of output. So try not to pay attention. Okay, so the important part comes here where the deployment is finished and Uptimer prints out this nice summary for the various measurement that it did throughout the deploy. HTTP availability is just testing your application routability, so can I send requests to my application still? And in this case we're seeing that the word FAC zero failed attempts out of about 2,000 attempts. At pushability there were about two failed attempts. At pushability is just an ability to run CF push out of about 34. And furthermore it checks whether you can retrieve logs or stream logs and there are about six in one of them accordingly and really this is all within the threshold so you can really reach zero downtime in terms of CF API and log API components but it's still within threshold, it's still pretty minimal if you look at the number of attempts. Okay, let's also look at the second step. Similarly here we need to regenerate certs so we're gonna ask Rotator to do that. Regenerate all the certs for us. And for a Bosch deploy I will also just show you the Uptimer summary where we're seeing that there really is zero failed attempts to perform. Get requests and in this case it even had zero downtime when it comes to CF push, which is pretty amazing. And step three is very similar. I'm gonna skip the Rotator step because I think it's pretty obvious but let's again look at the Uptimer summary and again we're seeing that there is very minimal downtime in terms of CF API and practically zero downtime when it comes to HTTP availability. I should also say that zero downtime is a little bit of an overloaded term. This obviously is a subject to the precision of your tools so in this case we can just kind of estimate in terms of the number of attempts that are being made and the number of successes. So let's talk about the role of each component in that implementation. So to start with we have CredHub that is our secure storage of CAs and certificates. It also has some convenient APIs for regenerating these certificates and more specifically regenerating certificates signed by a certain CA. If you can do them in bulk you don't have to do individual calls. Then of course we have Bosch. Bosch is a great tool that helps us have repeatable deploys and in this case it obviously handles deploys as well as integrates with CredHub to retrieve all the latest credentials. So we're really utilizing that CredHub Bosch integration which is very important. Rotato, this is the glue code that I was telling you about and it really does kind of two things. I make calls to CredHub API to regenerate these credentials as well as it will create these concatenated and unconcatenate CAs in CredHub and CA concatenation is really like more of a standard practice when it comes to trusting multiple CAs at the same time. And of course there's concourse which helps us build pipelines and automate all the things. There's definitely benefits to the current implementation. For starters it's completely automated which is really awesome. Because it's automated it's a lot less error prone. You no longer have to deal with this manual process of CA rotation. We saw that there was zero application downtime. You don't have to worry about incurring a downtime when you're rotating your certificates. And there was pretty minimal CF API downtime. So where's the future of certificate rotation? I hope that most of the glue code that currently exists will eventually exist in Bosch and that specifically comes to that configuring CAs to trust, configuring components to trust multiple CAs. So that would be concatenation of CAs. There is kind of like more in the future is automating certificate rotation for the Bosch director itself. Where we're seeing is that the Bosch director is kind of improving its security posture as well adding more TLS to various interactions between the agent and the director. And at some point we will need to solve that problem as well in the way that could be automated than not causing you downtime for your deployments. Okay so in conclusion we've learned that certs are an important part of a distributed system or I should say of a secure distributed system. And rotating certificates is also very important. We've learned the algorithm for CRotation that could be done in three simple steps. We also saw that using automation we can make that CRotation frequent and reliable. And all together the combination of Bosch, concourse and credit can really make certificate rotation easy and more streamlined. I'm including some resources here. So the code and obviously CF deployment and CF deployment concourse tasks in case you're interested. But that is it, any questions? I want to point out on the technical note that you can throw away the whole CF internal CA and all its certificates and just deploy and it works. So no jobs in Cloud Foundry will refuse to start up. If the certificates will not match the one of the others they only will stop starting up if the CA certificate they have locally doesn't match the client certs it has available locally. We had that case in January when we had some minor issue during rotation when we found out that the CA certificates the Bosch CLI creates are not compatible with the client certs credit will generate for them. Interesting. Okay, that's interesting. I haven't tried that particularly but I have tried in fact rotating all CA's and certificates in one step and you will end up with a failed deployment. But I haven't tried doing them piecewise. So perhaps that works. Have you considered rotating client certificates as well? By that I mean we plan to allow customers to have their own certificates which we get from services like Let's Encrypt and we plan to store them in Crathub. So the problem currently that we face is that we don't get notified when one of them expires or is going to expire which happens very frequently with those like every 90 days or so and currently there's no solution for us except checking once per day and asking Crathub. Yeah, so the algorithm that I've described will work for both the mutual TLS and one way TLS. Most of the Clifandry CA's are really kind of stored in Crathub and they're only used internally but it should work for mutual TLS as well if you use that same algorithm. Okay, thanks. Does it work for Pivotal Cloud Foundry? Your pipeline? I do not know. I've only tried CF deployment kits. Okay, then thank you. If there are no other questions and have a nice coming home to this evening and see you next year in Pennsylvania.