 That's around. Cool. Okay, thanks for coming along, everybody. I'm Doug Chivers, I'm a skeety architect at HP Cloud, been here for a bunch of years now. and together with Tim Kelsey and Tom Kammon, I'm going to talk to you about Secure Ffemoral PKI with Yankee Project. So first up, I'm going to talk to you about Ffemoral PKI. Explain what one is. Tim is going to talk to you about our implementation of the Ffemoral PKI ac Tom yw'n gweithio'n gweithio'n gweithio'n gweithio'n HPE Helion OpenStack. First up, what is an ephemeral PKI, and why do I want one? So, today everyone seems to care about cloud security. It's the number one barrier to cloud adoption, it's what most of our customers worry about. And effectively, we need to make cloud secure. So I'm going to start covering very briefly why you need a PKI at all and then drop into what an ephemeral PKI is and why one of those is better and you need one of them more. So, typical cloud deployment has got loads of services. You've got Nova, Swift, Glards, et cetera, et cetera. Traditionally, you'd separate them into security domains, look at some sort of filtering, some sort of monitoring between the security domains, maybe using physical separation if you're particularly cautious, maybe using VLANs, and this works great on paper. But then you come to add in all the other services you need to make a system work and you end up with a huge number of things leading to talk to lots of other things. This doesn't work out so well, so the next step is your security guide tells you to turn on TLS because everyone knows you just turn on TLS. And this makes everything confidential, secure, more cryptos has got to be better, right? Unfortunately, all the supporting services also need TLS, although there are some exceptions. Turn all of that on and it's all happy, except TLS needs some supporting infrastructure of its own. Most TLS deployments use X509 certificates to identify the TLS endpoints. Typically, you use certificates on the server, but in some cases, you use them on a client as well. And the certificates are obtained by submitting a CSR to a registration authority, which handles decision-making, issuing decisions, and then the certificate authority handles the actual signing of the certificate. During the process, a server owner would submit a CSR to the RA. The certificate administrator will look at the fields in the certificate and go, do these meet our policy? Are they sensible? Is this server owner trying to do the right thing? And if it matches, then issue the certificate. The certificate authority also provides a bunch of revocation functionalities so that if the administrator decides he's made a mistake and shouldn't have issued that certificate, or if the certificate becomes compromised, he can then revoke that. However, there's a few issues associated with this. Firstly, revocation is largely theoretical. There's a couple of mechanisms, a CRL and OCSP. Certificate revocation lists are just what they sound like. They're a list of certificates that are revoked. Unfortunately, it becomes largely unwieldy in real life. I don't think any operating systems ship with CRLs for quite a while, and they need to be maintained and kept up to date. Secondly, you've got OCSP, Online Certificate Status Protocol, which is implemented in a bunch of web browsers but in almost no client libraries whatsoever. So neither of these things work particularly well in service-to-service architectures. Secondly, bulk certificate refresh is pretty difficult. If you end up yourself in a harbly situation where you need to refresh all the private keys in your organisation, ideally by yesterday, it's very hard to figure out which ones are still in use. You better hope your PKI and your CMDB are very tightly tied together, otherwise you'll have roughly a week's spending figuring out which certificates are still in use and which ones aren't. Another problem is that manual certificate administration is not 100%. People make mistakes. I have issued certificates incorrectly, and I'm the person who wrote certificate policies, so I should know what has been issued. If your certificate administrator says they've never issued a certificate incorrectly, they're probably either mistaken or lying. The worst part is they may well have not noticed they did it. And finally, certificates expire. Everyone knows that certificates are going to expire, but sometimes it still catches you out. And it quite often results in a 3M phone call asking why something's gone offline. So, a fair more PKI. We've taken an alternative approach to PKI and are focusing on using passive revocation rather than active revocation. We deal this with short-term certificates, which are typically valid for 12 to 24 hours. Passive revocation works with certificate expiry, so once a certificate has expired, then it's no longer regarded as valid. Clients more or less without exception support this and apply strong checking to expiry dates, even if they don't do any revocation checking at all. Clearly replacing the private key is a client decision, but when the certificates are being bounced every 12 to 24 hours, it's not a massive challenge to replace the private keys as well. And at this point, you've got a poor man's version of perfect forward secrecy. So, the nice thing about passive revocation is there's no need for CRLs or ACSP. You simply wait for the certificate to expire. In the event a certificate needs to be revoked, you just wait 12 hours and don't issue it another one. Now clearly maintaining this is not something that a certificate administrator could do. We have probably 1,000, 2,000, 3,000 certificates running on our public cloud at any moment. A whole bunch of those are still in use. Some of them aren't. Can't tell you which ones. And no certificate administrator could replace 3,000 certificates every day. So at this point, we built a rules-based certificate issuing and management process called anchor. The ephemeral PKI uses a rules-based decision engine which applies a series of validators to enforce the certificate policy. The certificate policy is roughly the same as a certificate administrator would use when he's making a manual decision, except it's enforced automatically. So it's enforced 100% every time. And finally, it's a stateless system. So it's easy to deploy in high availability, which has some nice benefits like you can deploy it in silence. So if I don't want my Nova nodes to trust my Swift PKI, then I simply don't install the trust anchor for that. And a compromised Swift certificate couldn't be used to impersonate the Nova nodes, say. I'm going to hand over to Tim who's going to talk to you about the anchor project, which is our implementation of this. Thank you, Doug. Okay, so my name's Tim Kelsey. I'm a security engineer working mostly upstream for HP. And I'm going to talk a little bit about anchor. Okay. Right, so as Doug mentioned, the basic idea of what an ephemeral PKI actually is and what advantages it has. One of the disadvantages is that there's a lot more requests, a lot more certificates needed to be refreshed a lot more quickly. So anchor is our ephemeral CA implementation and an automatic registration authority. It issues very short lifetime certificates as part of the public key infrastructure. And the key points there are that it has no active revocation mechanism at all. It relies entirely on passive revocation. It is itself ephemeral. It's stateless, which means it's very easy to deploy in HA configurations and what have you. And it has no additional deployment overhead. You don't need to add OCSP responders or maintain databases and what have you of who's got what certificates. You just issue them and basically forget because they're going to expire and you rely on your policy to not reissue that certificate. Okay. So anchor itself is a fairly recent project that evolved from an internal component that we developed at HP. It's since been open sourced, released under the Apache 2 license and is available on Stackforge. The project itself falls under the auspices of the newly formed security project. This was formerly the open stack security group and the vulnerability management team who have now merged. So this is one of the projects which now lies under that project. So we desire anchor to be a good open stack citizen. So we have a very strong focus on our CICD gate tests. We're aiming for 100% test coverage. We've probably got about 98% currently. So we're well on the way to getting that. And we've also integrated a tool called Bandit into our gate test. Bandit is another security project undertaking. It's a tool for scanning Python code automatically and flagging up potential security problems. It's kind of like a linter but with a specific security focus. We actually have a talk later this week about Bandit specifically. So if anyone's interested in that, come along. And finally, we've tried to make strong idiomatic choices for our libraries, independence and bits and pieces. So there shouldn't be any surprises to anybody who's familiar to working on any open stack component when looking at the code base. So functionally anchor falls, it breaks down into four main functional blocks. We have a REST API, as you'd expect. We have an authentication system, the decision engine, and then finally the certificate issuing system. So our REST API is built upon Pekin, as you would expect. It's very simple, very minimal. We actually only have a single endpoint slash sign. This is an example of actually using a very simple request. I'll show it on screen there. We have four basic headers. You provide a user to indicate who's requesting the certificate. We have a secret field. This is used by the authentication module. I'll talk about that in a moment. We have the CSR. That's an X509 certificate signing request. And finally, a field to indicate how that CSR has been encoded. We only support PEM at the moment, but that's a future flexible thing. So once the REST API has ascertained that all the expected fields are available, the next thing is the authentication module. So Anchor actually ships today with three different authentication modules that can be configured through a JSON configuration file. We have a very basic shared secret module. It just uses a preconfigured shared secret stored in the configuration file itself. It's very basic, not particularly secure, but it is self-contained. So if anyone wants to try out Anchor or run it in a test environment or what have you, it's the easy choice. In addition to that, we have a keystone-based authentication module which uses a keystone token. And this has the advantage that it can pass through roll information into the decision engine. I'll speak about that in the next slide. And we have an LDAP implementation which passes through group membership as opposed to keystone rolls. So assuming the authentication module says, yep, everything is great. We move on to the next step, which is the decision engine. The decision engine is our automatic registration authority, effectively. So as Doug mentioned, it's built out of a series of rule chains which are built from composable validators where a validator is simply a Python function. We didn't use any domain-specific language or anything complicated here at Python. We all know it and it does the job quite nicely. So effectively, the CSR is presented along with any extracted authentication details to each of the validators in each of the validation chains. The validator will run. Either it will exit cleanly, which indicates that the check passed or it will fire an exception to say, yes, something's wrong. And in the event of an exception, we bail out and report an error to the user. As each step in the validator chain executes, we emit appropriate log events to say if it failed, if it passed, why, what have you, for the purposes of forensics and auditing. So anchor ships with a whole bunch of these validators. There's a lot of interesting stuff. Conceptually, they fall into three sets. We don't really make a hard distinction in the code. They're all just functions, but you can think about them this way. So at the lowest level, we have CSR sanity checking. These just make sure that all the right punctuation is in the right places in the CSR. It's just syntax, really. Right number of fields and no duplications and bits and pieces. So above that, we have validators which enforce the security policy. These check that the FQDNs are correct, that they resolve to appropriate IP addresses and everything is as you would expect it to be based on whatever security policy you have. And then finally, and perhaps most interestingly, we have a number of validators which encode expert knowledge. And this is specific information about how anchor is being used and how your actual cloud deployment is configured. So, as Doug mentioned, you could silo off an anchor instance to provide certificates for your nova nodes. In that scenario, you might have a naming convention and you know that all of your nova nodes have FQDN that matches a certain pattern. And you can encode that into one of these expert knowledge validators. You may know that all of your nova nodes live in a certain predefined IP block and you can verify that the provided FQDN resolves to an IP within that range. And you can do various sensible things about making sure that the appropriate roles are present and will have you for provisioning nova resources. So you could write very specific rulesets for individual bits within the cloud deployment and use a siloed version of anchor to enforce those. So assuming that the decision engine said everything is great, this is a valid and reasonable and correct CSR, then the next step is actually issuing the certificate. So this actually is probably the most straightforward bit of the whole thing. It's very basic. We don't really write our own crypto code here for obvious reasons. We're using PyCA cryptography, which is an excellent, very high quality library, which wraps through an FFI interface open SSL in various other back ends. So we actually needed to add a little bit of X509 code of our own just to wrap the FFI bindings exposed by cryptography. Some functionality was missing though. Hopefully that's changed since I wrote this. And yeah, we am its appropriate audit logs to say, yes, we issued the certificate for this person on this date and so on and so forth. So, okay. So that's basically a sort of whistle stop tour of anchor. It's a very early project, as I said. So we've got a fairly extensive roadmap of bits and pieces that we're going to be working on going forward. Here are a few of the highlights. So documentation isn't exactly where we'd like it to be today. So we're going to pretty much straight after this, actually. We're going to be putting a bunch of effort into making that come up to scratch. We obviously want to replace this bit of X509 wrapping code that we have and call directly into PyCA cryptography. So a Barbican plug-in would be a really great thing to have. Barbican, as of the kilo, I believe, has certificates operations now. So it would be great to be able to use anchor as a backend CA for those operations. We'd like to add a KMIT-based HSM interface layer so that we can use a hardware security module for the certificate operations rather than the locally available OpenSSL library or whatever. We'd also like to put it in the PyCADF library for NYSER audit logs and various bits and pieces that you get from that. More validators will emerge as we explore additional use cases. We've talked about a message queue model for getting certificates so you can use a queue to issue requests and get your certificates. And some of our profiles would be great for hardling the whole thing. So those are all things that we're working on. So if it's peaked anybody's interest, then we'd always welcome patches, comments, imports, whatever would be great. So that's all I have. I'm going to now pass the floor to Tom who will talk about how we actually use this thing. Thanks, Tim. Hello, I'm Tom Cowan. I'm on the compute team for HP's Helium OpenStack. I recently will be looking at using TLS in our deployment and getting TLS everywhere using anchor. So we've been running anchor in production now for about a few months since our 1.0 release. So I'm going to run through how we've done this and the lessons we've learned. So this is all based on a triple O deployment in 1.0. So there's already work done by the community to get some TLS in triple O. So we based our initial design around what was already there. So this is the existing architecture where you get client connection coming in to the control plane. So the control plane is just like management node running all the API servers, all the interesting stuff like that. So connection comes in to the virtual IP address, the VIP, which is assigned to one of the nodes in that cluster. And this is hands off to the load balancer and then off to the local service around the IP address. So we used that and we added a bit into the gap. So now we've got the client using its native HTTPS coming into load balancer again. But this time handing off into the TLS terminator. In this case, we're using S tunnel. The reason we use this rather than having the services doing their own native termination is because they're not really designed for reloading the certs every 12 hours. So we tried to implement it natively, but none of the libraries are really there to reload certs on the fly. The Python libraries don't support that yet. Some services do, but we can't went for this blanky approach using this TLS terminator. So this is a more physical diagram of how the connection is flowing through our cloud at the moment. I was at HA Proxy load balancers across to tunnel and then tunnel strips off the TLS and puts plain text across to the local host. So there's no unencrypted communication across our cloud at all. So talk about a few of those components. So we've got S tunnel. S tunnel is configured to listen on a port and then it needs another port to hand off to the unencrypted connection. So it's really easy to configure. Works really well, and obviously you can reload certs on the fly. You just have to send a sik up to it. The last little gotcha we found was a really bad bug in up to 5.9 where occasionally it would just prematurely close the connection. So we're having a glance. Image downloads just fail completely randomly. Pretty easy to harden as well. Make sure you disable SLV2 and V3 unless you want to get poodled and make sure you choose your cipher suites. We've got a load balancers at HA Proxy. First thing though is it doesn't really know about TLS at all. It's running at a later four. So it just hands off TCP packets to the services below. The thing we did have to change about it was the health checks. So the health checks pretty much just to make sure that the services underneath are alive and they're still running. Initially before they were using the default HA Proxy checks which were just socket connections to see the service still running. But now we've got this layer of S tunnel in between it. We have to do full HTTP checks now which has improved the availability of our cloud, actually. So we've configured that to be much nicer. We've actually made some better checks as well for grabMQ and MySQL to check the partition and stuff as well. So deploying anchor itself, really simple. It's based on PCAN which means you can just make a virtual amp, pip install your requirements, make sure you're apt against all your other requirements and then run it up on new whisky or whatever whisky-based stack you want to use. We did that in triple O with an image element and then we configured that with heat. So heat just adds all the information you need about the validation rules, how long you want your certs valid for and that'll go in and configure it as it works really nicely. Along with deploying anchor, you need some other things. You need NTP. Because we're issuing certs every 12 hours, there's a small gap when you replace a cert whereas if the nodes are out of sync, you can get a node completely unavailable because the client doesn't trust it. So when the client connects to a server and the server has a cert which is issued in what the client thinks is the future, then it'll be like, oh, I can't do this. That's invalid certificate. So make sure you've got the timing on all your nodes really close. We did have the option of backdating the issue date of the certificate but we didn't want to do that hack so we just made NTP really tight on our cloud. Another problem with deploying anchor was you have to have everything behind TLS. So that means anchor as well. So if you want to get your certs for your other nodes, initially you have to have your control plane nodes come up first. So we've got this sort of ordering. So the control plane node has to get first from the local host, kicked it down onto action again so it can be accessed from the rest of the cloud and then you're up from there. Obviously, before anchor gets up, none of the other servers can talk to each other because no one has a cert yet. So there's been some quite important ordering there. So the anchor client is the client that has to retrieve the cert from anchor, pass in all the information, create the CSR, also has to check the expiry. So when we want a new cert every 12 hours, we want to check when it's about to expire, when we need a new one and then generate CSR and talk to anchor. We also always get a new cert on reboot which is a check we added because we're finding out that when a node was coming back up from a reboot sometimes, it wasn't in sync with NTP because NTP sometimes takes a while to catch up with what the real time is. We always get a new cert when we reboot, make sure it's valid for another 12 hours and continue. Also we've got some other actions we need to do when we get a new certificate such as kicks down onto life again, reload the certificate for the TLS terminator. So initially we looked at using cert mug for this but it was a bit too bulky, it has quite complex configuration for it. It just wasn't simple enough for what we wanted. So we just went off and used some cron and bash. We're carrying checks every couple hours with cron. That just calls out to an open SSL command you can see up there which just repars the date. Check it right if it's... It's about to expire in the next hour. We'll get a new certificate. Generate a new certificate... Generate a new certificate is our talk to anchor using that curl command and anchor will talk back to us, hopefully give us a certificate and we can carry on. So some of the problems we faced deploying this one of the big ones was clustering traffic. Clustering traffic is really hard to deal with ephemeral PKI. Both Rabbit and MySQL have clustering traffic with TLS but it's not really possible to do that natively with ephemeral certificates mainly because they're neither in support reloading certificates while having a cluster up. So MySQL Galera clustering actually requires all the nodes in the cluster to have the same private key and certificate. So it doesn't make sense at all to choose a private keys... ephemeral private keys there and refresh it. So you'd have to reboot the cluster as well. It's just got really messy. A similar sort of thing for Rabbit. We did talk to the Rabbit team. I think there's making progress on being able to refresh certificates as you go along but as of the moment it's not there yet. So we do use long lived certificates for these but we can't use ephemeral PKI unfortunately yet. So configuring open stack services was a bit pain also. There's so many different flags and protocol settings you have to do in all these different config files. So it took our team a couple of weeks to really find all the little config values we had to get and set them all and then eventually we got TLS everywhere which was great but it took a while to find them all and we had to patch a few client reapers as well. So I think there is some work being done upstream to unify all these options so it's coming but we had to find it all. The monitoring as well was a bit tricky. That's because when we get the certificates for the nodes the certificates are signed for the virtual IP address on the control plane. That means the monitoring wants to go directly into the node. It can't validate the certificate for the actual node itself. It has to validate for the IP address or the virtual IP address. So we have to turn off cert validation for monitoring which isn't a big deal but it's nice to have. So some improvements we're looking at doing in Hedeon 2.0, Hedeon Albus stack 2.0. We want to have a more layered termination so where we saw HAProxy not aware of TLS. We want to have it aware of TLS and then re-encrypt pass-on and that gives us a lot more flexibility of how we control that. We also want to have multiple IPs supported in the subject old name. That means you can have the cert valid for multiple IP addresses or addresses. We're also looking at doing more side deployments so per service certificates which are really hard on us a lot. And also having a look at having the services listed on a Unix socket. So rather than having the service running on the network you can keep it completely local and have external hand-off to the Unix socket which would be really great. So here's a few references for you. Our cathead is a project I've been working on locally which kind of solves the rather new in front of bash for cert tracking and cert retrieval. It's a bit of Python code which does some bit smarter scheduling and talks directly to anchor extendable. So check that out if you're interested in that sort of stuff. Obviously you've got anchor in stack forge, security project and everything else. Thanks guys, if you've got any questions I think we've got some time. Hi, thanks for the talk. So a quick question about the cron with the curl. Is there any the cron job with the curl? Is there any plans to use a configuration file or allow us to use a configuration file instead of the username password so it doesn't make it into the process list? For the anchor client sort of stuff. Yeah, I think that project's working out at the end there. Our cathead is a lot more configurable, a lot more open and we've got some plans to put some time into that to make it. So it's definitely possible to do that for the configuration values in that rather than the cron job. So yeah, definitely a good point. Okay, thanks. A couple of questions at the conceptual level. You're using username and password to authenticate initially. So if someone hacks the username and password and therefore gets the certificate, what do you do for the next 24, 48 hours? That's the first question. The second one is someone or an entity genuine that's got a certificate and it's about to expire in say five minutes. And then it actually wants to authenticate and do something that's going to say last ten minutes or longer than the life of the certificate. Okay, so please stay near the mics just in case I need to. So the first question was regarding the username and password used to request the certificate and what happened if someone compromised that username and password? So first of all, you should be deploying this in production using Keystone or LDAP rather than the static username and password which gives you some level of protection there. But the idea is that you should tie your validators down quite tightly and the validators are completely user-controllable so you tailor them to your system and they should be built so that only a person who... sorry, only a node that meets those criteria can request a certificate and it'll only be able to request a certificate that meets the criteria of validators. So you could only issue a certificate to Nova rather than Google.com, say. Obviously it's the issue that if you log into the CA and you could compromise the certificate that's the same as every other CA out there. So this doesn't particularly change that risk profile. I'm sorry, I've forgotten your second question. I think I can just answer that one. So you say if the certificate expires during the connection, is that what you're... So I don't think that's really a problem because the only time of validators is when it initiates the connection or during a renegotiation of the connection and typically services don't ever issue renegotiations and you disable a renegotiation. So, yeah, I don't think that's really an issue. Even if you expire while the connection's going it should just be fine. To add to that, we should have mentioned that we have it configured and we generally recommend it to be configured that if your certificate expires in, say, four hours time you request a new certificate and then roll that in place early so that you're not running right up to the certificate window and then replacing it. No, but you said replace with LDAP. I mean LDAP is still username password, isn't it? Yes. So this is where the validators come in. So it should only grant the node that is requesting a certificate, a certificate for that node or similar nodes in the area. You shouldn't be able to request a much broader certificate. Right, but does it actually check that the request is coming from the node or is it just purely the username password that is sufficient to prove that it's from that node? So it will do things like a reverse lookup on the IP that the certificate is being issued for and check that against the IP address the request was submitted from. So if you've submitted a request from 1.1.1 and you're requesting a certificate for Google.com it will notice that your 1.1.1 isn't the address for Google.com and then log it, raise it to your Arc site and administrative should investigate. Right. So you've got some checks in, but ultimately the issue is if you find out after a certificate has been issued for whatever reason that it was issued wrongly you are effectively screwed for that amount of time until the 24 or 48 hours run out. Correct. But that's no different to using CRLs where you're screwed for as long as it takes them to product. That's true. Sort of following off one of his questions is there interest in adding support for different authentication such as GSS API or getting a remote user header from Apache? Absolutely. And the other question is can you add extra fields to the certificates that you sign? So to answer your first question and your second question was adding extra fields on to the certificates that get issued? Yes, so except custom fields and you can assemble those in. Okay, thank you. I should say as well, sorry, that you can add validators quite easily as well so if you want to add custom validators the idea is that we will acquire a live as small usage scenarios emerge. Hi, you mentioned at one point you had an issue with image downloads intermittently failing. What was the cause of that, did you say? That was to do with a bug in STUNL up into STUNL up to 5.9 so it was just prematurely closing the TCP connection before it actually needed to be closed. Thank you. Is everyone in? Great. Thank you.