 So we're gonna go ahead and get started So welcome to building a robust cloud foundry. My name is Duncan win I work for Pivotal and I work for a cloud foundry services team and my name is Hayden Ryan and I also work on the same team as Duncan essentially what our team does is we spend about half of our time with the engineers Working with them and then the other half of the time working with customers Taking them say from a POC and building them straight through to MVP one Now I must apologize if I do cough during this presentation. I'm still really quite sick with the flu So just bear with me. I apologize Yeah So in this talk, we're gonna focus on our experience making cloud foundry a bus So really hardening cloud foundry and that's looking at taking the four levels of HA for the forward to make cloud foundry highly available Looking at how you recover from a disaster situation. So if the worst happens, how do you bring back cloud foundry and boy and bush? Excuse me and also security. So how do you lock cloud foundry down to make it more secure for your end users? So we're gonna jump straight into high availability HA So when most developers think about high availability, they're really thinking about their apps and their services How you keep those services running in a performance reliable recoverable way So they're a joy to use for the end user. They don't go down. They just work And if something does go wrong, there's timely error detection But when you look at cloud foundry cloud foundry does this for you especially around your applications It's got four levels of HA built-in already. So if you start with the application instances Between the health manager and the cloud controller cloud foundry will look at the actual state versus the desired state And if an application instance goes away or bring it back for you If a platform process dies the platform processes are being monitored by a monet and Monet will restart or try and restart that process for you If something more catastrophic goes wrong if a VM goes away Then one of the many actions that Bosch can take is to actually resurrect that VM and bring that VM back for you And finally, there's this concept of availability zones where you can stripe your DEA is across different availability zones So if you lose an AZ or if you lose a disk store, then you still have your applications running So what's left for us to do? Well, really we're focusing on this concept of availability zones and extending it to beyond just DEA's To the actual cloud foundry components themselves so that we can keep cloud foundry running in a performance highly available way So for companies who really need this and this means that their needs are beyond just a dr scenario They need to recover quicker than they can from a disaster scenario We see two patterns that have come up again and again The one on your left here is two cloud foundry deployments in two different data centers This could be one cloud foundry on Amazon and one on vCenter It could be active passive or active active active and you can then load balance between these two environments The one on the right is to take a single cloud foundry deployment and then split it across two different AZs So we're going to look at the trade-offs between the two here We're going to start with the one on the left This does mean that you have to take your application and deploy it into two different locations So simple to deploy you're just deploying cloud foundry twice But the operation of that is more complex and you can deal with some of that complexity through a CI pipeline So how does it work in terms of traffic flow the end user they're agnostic to the underlying environment They don't care which data center they hit they just target my app at my cloud foundry.com And they then come into a GTM a global traffic manager Typically for most companies we worked with that's like an F5 load balancer that consults a DNS resolution service So through smarts like geo based IP routing and looking at various different load It works at which data center and which cloud foundry to route the traffic onto So it comes into a specific data center and hits an LTM a local traffic manager again for most companies That's something like an F5 load balancer on appliance That should have a VIP virtual IP and so if that appliance fails There's another appliance on standby to grab that VIP and then take over so it's a very fast failover now in this scenario we had a problem and that the appliance the physical appliance sat on the corporate network and It dealt with traffic from a number of other departments And they didn't want traffic to be decrypted or you know SSL terminated at that layer and then pass unencrypted traffic to the sensex boundary They were using this NSX boundary for their software to find networking and they were using subnets behind the seams So what they did is they re-encrypted the traffic passed it behind the NSX effectively firewall And then they decrypted an HA proxy layer So it meant we had three layers of load balancers, which isn't great But doesn't really matter too much and you could easily take out the third load balancer by putting a physical appliance in that NSX layer One other point to note is that we had the certs for both data centers at the LTM layer And this is because if something went wrong if you couldn't hit that cloud foundry instance Then you could still route traffic over to the other data center a couple of other considerations We always advocate to have two different domains one for system and one for applications because you don't want your developers To start registering apps like UAA and cloud controller. You want to keep those separation of concerns Because we had two cloud foundry installations We did want to give the developers the ability to target a specific cloud foundry foundation And so we ended up with four domains two system and two application one for each data center But as I mentioned the end user they need to be agnostic so you end up with this generic my app and my cloud foundry.com The last key consideration with this is services and this bites a lot of people How do you keep your data concurrent across two data centers? And ideally you need something like a stretch layer to network with minimal latency something like sub five milliseconds The last customer we worked with We looked to analyze their data usage and they were using an awkward database But they had an application with a long running session and they just needed to cache data for that session It's like a 45 minute session for a call center app so we looked at using Redis as a Cache and it was really appropriate But this solution with one app targeting two Redis clusters to keep the data concurrent is really ugly for the developer So next we explored instead of putting the onus on the developer to keep data concurrent in two data centers Why not have a service and have the service manage that concurrency and write to two data sources? It's better because the developer doesn't need to deal with that complexity But you still have the latency going across two data centers, and it's still not great So my preferred solution is to use some technology built for this something like Gemfire or Cassandra which can propagate data synchronously or asynchronous synchronously over a WAN and they have Collision detection algorithms and all of that good stuff So they're really designed to keep data concurrent and consistent between two data centers So the second deployment that we're going to look at is a single foundation of cloud foundry split across two availability zones Now as Duncan has already mentioned Quite a lot of the time what we end up deploying really depends on the customer So companies have very unique environments. They have very unique requirements and One of the fantastic things about cloud foundry is it's very flexible in the way that you can deploy it So in my case study We have a few of the examples of unique requirements that companies may have With the particular customer I have in mind We had multiple multiple deployments of cloud foundry in a single VPC We had very restricted IP ranges So we had to override all the subnets or the IPs the site is etc. And we had some Routing requirements as well. So we had to use their internal corporate DNS We weren't allowed to use route 53. So this was on Amazon and We also weren't allowed to use elastic load balances So what this ended up meaning is that the way that we deployed cloud foundry was a little bit different to the way that we would Normally recommend deploying cloud foundry in an AWS VPC So this is what we ended up deploying there Their DNS was a bind DNS. So it was actually in an external data center. So it was not actually in Amazon itself We use direct connect to get into the VPC where the customer had a customer managed bastion box They also had a customer managed NAT box. So that kind of took that out from us managing things like security etc around that and then We have our two Oops, where's it gone? There we go our two availability zones that we split cloud foundry over Now I'm not going to go into too much of the detail with all the jobs in this particular slide But hopefully if you can't see the slide that well, you can download the slides at the end So this is how like let's abstract it up a level This is what we ended up doing. So using SSL termination at the HA proxy level using bind DNS round robin now that was Imposed on us from the customer this meant that if an availability zone did go down There'd have to be a manual step to actually D-register the IP for the HA proxy or the HA proxies for that availability zone So that did add kind of almost a single point of failure But um that was mitigated by a failure matrix, which we provided to the customer Now the reason that we have two HA proxies in each availability zone and to go around as backing them Was so that if an availability zone did go down The instances that we already had deployed would be performant enough to handle the increased traffic load Similarly, we also scaled the DEA's So let's start looking a little bit further into this particular customer's deployment of cloud foundry So what we started to look at was who does cloud foundry need to be highly available for there's really three classes of users So there's your end users that are consuming your applications There are your developers who are developing applications reading logs and pushing them and There are your operators as well so your operators are operating the cloud foundry environment and Maintaining it keeping it highly available providing upgrades, etc So with this particular customer their main priority that required a hundred percent uptime was their end users so straight away that makes all of the Cloud foundry components that are related to getting applications running or keeping them running and providing data through extremely critical So we needed to make sure that all of these were deployed in a highly available manner with at least one instance in each availability zone Some of the other Jobs we defined as not so critical. I mean still it very important obviously But if there was a little bit of downtime with these jobs The effects would not necessarily take down cloud foundry or the or the user's apps So these were things like logging The health monitor if the health monitor goes away. It just means it's not monitoring your apps So if there's a short period of time where it's not monitoring that might be okay depending How many levels of HA you've still got running? As well as things like Bosch, so again Bosch if Bosch goes away That means you don't have a resurrector you can't resurrect VMs. It also means that you can't administer the cloud foundry environment But you know is that critical to the operation of cloud foundry? It really depends on your use cases so in this case it just kind of Removes one to two levels of the high availability that Duncan spoke about earlier and there were some things that were not so critical so The clock global is essentially a cleanup job. So it tells the cloud controller Periodically to clean itself up. This was not critical. So we decided that That could only have one instance in one availability zone. We didn't need to replicate it Similarly having a jump box. We did have a bastion box as well that was Set by the customer. So we decided that spinning up a jump box wouldn't actually take that long In a downtime situation, it would just increase the time to recovery so we ended up deploying minimum of one instance in each availability zone for all of these particular jobs for the customer so Choosing a deployment topology really comes down to a lot of different factors So there's no one-size-fits-all. You need to be very cognizant of the trade-offs between each of these apologies For a lot of customers and companies a single deployment in a single data center That's highly available within that data center and a really good disaster recovery Procedure is enough to push your company forward Other companies will require Levels of ha so you could either use a dual deployment So two single deployments across Different I as is you could have them in different regions The issue with that is it's quite easy to deploy, but it's quite hard to administer and develop for It's also quite hard to push apps to both of them But that can be mitigated by using a CI CD pipeline and the final deployment is Quite complex because you have to be very cognizant of all the individual components of cloud foundry But once it's deployed, it's quite straightforward. It's just a single deployment. So it's easier to deal with Okay on to a disaster recovery So how should you back up Bosch and how should you back up cloud foundry and it really comes down to these core components You need to back up your NFS server or your blob store so that your compiled packages and your artifacts are backed up You need to make sure you back up your configuration and preferably source control that as well So your Bosch manifests and any vendor configuration as well So for example in our case pivotal cloud foundry back up ops manager and you also need to back up your cloud foundry databases and your Bosch databases When you look at this, this is effectively your cloud foundry in its rawest form Everything else around that it's just wiring. It's just processes and you can bring all that stuff back But this is the stuff you need to stay around So we're going to explore a couple of scenarios here And I'm going to take you through how we do it with pivotal cloud foundry and especially with ops manager and also How you would do this with just open source cloud foundry and Bosch So the first scenario is what happens if you lose ops manager? We use ops manager to deploy cloud foundry So for us ops manager is really critical and what happens if you lose your cloud foundry deployment? How do you get that stuff back? For ops manager, it's really simple you can export this configuration and Then if everything goes up in flames You bring back a new ops manager and Then you can import that configuration and providing you're Using something like an external database an external blob store for cloud foundry. Yeah, everything's fine To export ops manager you go into the UI and just download the installation settings It's really straightforward if you're really paranoid You can also copy the deployment manifests and with these deployment manifests with Bosch You can actually bring back your deployment if your deployment goes away But ops manager will do that for you as well if you're not using ops manager Then you do just that you make sure you have a copy of those deployment manifests So you can download those from Bosch and if your deployment goes away Then you can use those manifests to bring back your deployment and everything's good So Bosch is critical for bringing back your deployments and the Bosch director is critical for that So what happens if you lose that Bosch director if someone goes and deletes that VM if it's a single VM What we call micro Bosch or something else happens with it? How do you recover from that scenario? So we start with the same approach you back up your configuration and when you deploy the Bosch manifest Bosch director you have this manifest Bosch YAML and Ideally you need that manifest to bring back Bosch But herein lies a perceived problem in that when we go to the directory where that manifest should exist It's been deleted and the ops manager does this with good reason that manifest contains your AWS Secret keys and other sensitive information so we don't want to leave that manifest lying around in plain text on the file system So it deletes it So you don't have it and when everything else goes up in flames. How do you recover? So you bring back ops manager in the same way forward In the same way as our previously and then ops manager and this is a secret source It's got some capability to reconstruct that Bosch director with the same IP same set of credentials And so ops manager will actually bring that Bosch director back for you So how do you do this if you're not using pivotal cloud foundry an ops manager? So I would argue you absolutely need to keep that Bosch manifest around it's got all your credentials It's got all the information about your Bosch director And so ideally you need to source control that and you need to keep it And then with that YAML file with that manifest you can bring back your Bosch director But you need more than that because Bosch needs to be aware of all those deployments which you've already got out there So you have you have a couple of options here If you go into Bosch and you use monitor stop to stop all the processes You can back up the Bosch database if you're using the internal database And then when you bring back a new Bosch, you can do the same and then you can import that data into your new database If you need some more information and you're using the internal file system I'm sorry the NFS store then you can snapshot the disk Again go into Bosch stop all the processes detach the new disk and reattach the snapshot My preferred solution is to use an external Bosch database and an external blob store And then when you bring back Bosch on the same IP you just connect it up to your data store Store your database in your blob store and away you go When you're backing up the cloud found your databases You need the DB encryption key for the cloud controller because it's encrypted So in order to back up that and restore that database you need that encryption key In addition with the blob store you need to think about how you back that up as well So in this case using s3 we set a policy to deny bucket deletion And that meant that that bucket is never going to go away. The girl is not going to change So you've always got that bucket and then you can turn on versioning for that bucket or for the contents of that bucket And if someone goes in and maliciously or accidentally deletes that content you can restore that content in your bucket store there so wrapping up Quite a lot of the time Companies think of security and they think of it as an afterthought Security should not be an afterthought. It's actually really important So what I want to do is provide you a little bit of a primer some basic things that we've seen out with customers That they've done, you know potentially good things and just share them so that you can all kind of get a start in in cloud foundry security So security is a hard problem It's not a problem that is solved once it is a process that needs to be continually solved to continually updated continually managed as well so iterating through Security and security issues is really important Similarly feedback from any issues that have arisen say security incident reports Obtaining management support to hold meetings to discuss security, etc. Is fundamental so you need that Organizational backing before you can even start with security really So in security there are three main concepts The concept of restriction so using Access authentication and authorization to restrict access to your VMs and your jobs Limiting so limiting the scope if there has been a compromisation of a VM or a job as Well as mitigating any security breaches that do or potentially occur So let's let's dig down into them The first step is to restrict users so this is mostly talking about restricting users accessing your IS level your cloud foundry installation As well as things like Bosch So the number one step that we recommend is use multi-factor authentication wherever possible So each user should have an individual account something that can be attributed to them something that can be put into an audit trail To potentially find any issues that have occurred or any unwanted changes and Also lock that user out if their details have become compromised or if they're no longer with the company or disgruntled or whatever So multi-factor authentication at a minimum on the IS level Similarly, you can actually put MFA on to jump boxes. So when you log in using SSH Using an RSA key to actually provide a token It takes it from just something they know to something that they have as Well as something that they know in addition to that we also recommend that all Bosch users have their own separate accounts again this creates the ability to Individually identify and target any compromised user names and passwords that come up and finally github So you'll hear me kind of rabbit on a lot about audit trails, etc When you check your manifest into github, it's really important that you don't just use one Username or say a group username that each person has their own individual accounts and they're very disciplined at actually sending that information to github one of the issues that we found on one customer site was that they used a jump box with a single username and password and because of that whoever logged into github Was getting all of the all the credits for all the changes that would be made So unfortunately that was not good in terms of having an audit trail The second step that you need to do is to restrict packets So this is best done at the IS level or is done at the IS level I should say So I've focused mostly on Amazon security and Amazon deployments So a lot of my slides are going to be a little bit more targeted to that so on Amazon We have security groups these are Security policies defined at the instance level so they can span subnets They can span availability zones, etc and define ingress and egress rules Similarly you can use access control lists Which are at the subnet level we find that most customers don't actually use these but the option is there and Finally routes restrict where data can go so data from an ELB May come in and it will only be able to go to say HAProxies or to the routers Cloud Foundry itself has security built into it. You can set network properties to allow and deny IP ranges and ciders It is important to note though that the application security groups actually override this so if you're Creating a new Cloud Foundry deployment. You're best off locking out using the design networks basically everything except for of course the VMs for Cloud Foundry and Then progressively allowing Access using security groups at the application level through the CFC li Sorry Okay, so what I wanted to do is basically diagram out a The security diagram of pivotal Cloud Foundry 1.4 It just is a bit of a this is how our engineers have said that we should do it At a basic level So you'll notice straight away that We have a single VPC and this is all in one availability zone by the way We have a public and a private section So we've got a demilitarized zone which is accessible to the internet and then we have our private subnet Which is running all of our instances and all of our databases now One of the fundamental things when setting up security rules is to have separation of concerns So you'll notice that all of the VMs essentially or The instance types have separate security groups. So these red lines around them So ops manager is a separate concern. So it's got a separate security group The NAT box has its own security group as does the ELB and The Cloud Foundry VMs in the elastic runtime In terms of what you actually allow to come into this environment It is quite a good practice to restrict down access to the ops manager So by default have this security group to not allow any ingress whatsoever and force your users to actually go to AWS and enable access for themselves This also allows an audit trail using things like cloud trail, which we'll talk about a little bit later Now by default you'll notice that all of these security groups here are allowing ingress from anything in the VPC This is a default state. You can actually lock it down further For instance a couple of things you might want to look at doing is having the elastic runtime only allow Traffic from the ELB security group on say four four three or port 80 Similarly only allowing traffic from the ops manager security group on the Bosch CLI ports Which there's a quite a few of them if an attack has been successful The next thing that you need to focus on is limiting scope if there has been a compromise So let's say that run has become compromised An attacker has got a script on there We want to limit the ability for them to then jump on to other cloud foundry VMs and Access the data behind them one of the most basic things that we see the customers don't always want to do is having Different usernames and passwords for the different jobs on cloud foundry We recommend that both the usernames and the passwords be a random string of characters say about 20 characters long With the it uses both cases and that you avoid using YAML characters So cloud foundry is deployed using YAML You don't want it to get confused or you don't want Bosch to get confused if you're using special characters that it interprets differently So I personally prefer not to use any special characters whatsoever and just have like uppercase and lowercase So what does that really mean avoid using cloud cal for everything? Let's say there has been a security breach. It is important to Understand what has happened and how it has happened. So standard security post breach rules apply You should isolate any VMs that have become infected or compromised In AWS the way that you do that is you attach a special security group to that VM That only allows ingress from you and allows no egress out of that security group the cool thing about Bosch is that Bosch won't be able to see that VM anymore. So it'll resurrect a new one for you pretty cool But the most important part is it won't be compromised because it's resurrecting it from scratch Then you need to investigate how the that particular box was compromised. I Would recommend rolling basically everything in your deployment for new passwords usernames, PEMs and most importantly your IaaS credentials and You don't want the the attack have been able to spin up new boxes in your AWS installation and The final one is to have a good feedback loop have incident reports, etc. Management visibility to provide better and Consistently improving security. We gave this talk internally and one of the questions that came up Was how do you deal with the operator who gets very disgruntled and then Wants to take your environment down completely as they leave so One of the best things that you can do is really restrict individual IaaS users ability to Delete so we want to avoid them deleting S3 buckets We don't want them to delete subnets or VPCs. This is because the subnet ID is actually used in the manifest So depending on how you've backed it up It could be problematic bringing it back and the final thing of course you don't want them deleting your backups Everything else can be recovered. So that's the important point to take the other thing is AWS provides some support for advanced monitoring of users it provides through cloud trail things like cloud trail alerts so that If a particular say security group has had things changed in it, you can actually alert a wider group So this becomes really really important really useful because you can build audit logs. You can build safety Into it and you can also have the ability to roll back changes that have been made So we're at the top of our session So just do a couple of quick takeaways When you're looking at deploying an architect in cloud foundry especially in a high available scenario You really need to understand the trade-offs and the environmental constraints. There's no one size fits all It comes down to what you're actually trying to achieve and what level of ha and what level of dr you need Specifically at the service layer You need to be cognizant of the impact of dual data centers and any impact It might have what type of data you have how you want to persist it how you want to replicate it How you keep it concurrent all of those concerns You have to be aware of any corporate security concerns networking constraints Because they can also shape and affect how you deploy and tune cloud foundry When you look at backup, you need to back up your configuration You need to back up your databases and your blob store because this is your cloud foundry and Finally, you need to think about the usage of cloud foundry How things get used the egress and ingress of traffic and locking cloud foundry down to make it as secure as possible And with that, thank you very much. Thank you