 Welcome, first of all, to talk about the highly available cloud-founded deployment that Allstate have put together. I'm excited to be sharing some of the things that we did and how we also built out this architecture from the ground up. It was a real challenge for Allstate because it was really part of a transformation project as well. I'm moving away from the traditional legacy style systems that they had. I'm moving primarily onto Cloud Foundry for lots of the new services that we were looking to build out. So first of all, just a bit about myself. I'm one of the senior platform engineers within Allstate and what we actually do is our team is basically the engineers that operate and build out the Cloud Foundry platform and we're still constantly working on that on a daily basis looking to integrate new services into the environment to really enhance what we have there from a user experience perspective. I'm also on Twitter as well so if you have any questions or ever want to follow up on anything, do feel free to reach out to me. So the goals of today is what I really want to do is I want you to leave with a shared understanding of the foundational infrastructure that underlines the Allstate platform. So really what we put together to build and design this highly available platform and the operational principles that underline the platform. So some of the key aspects that we actually looked at from the start to say okay if we are going to build a truly highly available platform these are some of the things that we need to factor in to to the design and also how we also actually have highly available deployments. So towards the end I'll discuss that and provide more information that you can actually look at how we've actually achieved that as well. So first of all I'm going to look at some of the concepts that I think are key to what Allstate have actually you know what we actually looked at whenever we were designing and building this. So the first one is is the availability zone. So the availability zone it's an isolated location within our within our data center. So within each of our data centers we have you know two availability zones. We then also have our regions. So our regions are our data centers that are geographically dispersed between between the the midwest and the eastern US and also the security zones. So we've also split out what we actually have in place so that if it has to be secure we have a security zone for that. If it's something that we're coming in from the internet or public then we have our DMZ for that and these these are really the key building blocks that we have within the platform. Now just moving on and just looking at sort some of the limitations initially whenever we were actually in some of the challenges that we had whenever we were looking to build a site you know we couldn't deploy cloud found a single cloud found deployment across the multiple networks. So within each of our availability zones we've had to push out multiple foundations. On VMware you know the vSphere that we use it required a single management plane and we also required shared storage across the virtual machines. So they were sort of some of the challenges that we sought to start but we never let that affect where we wanted to get to as the end goal. So let's look now just at the actual architecture of the availability zone. As I said before we have two availability zones within each of our regions and we have a two region build up. Within those availability zones everything itself is self-contained. So each of the availability zones has its own network switches, has its own firewalls, has its own load balancers and its own shared storage. Now within our regions whenever we're actually within the the local environment between the availability zones we have about two latency of about two milliseconds. Across the regions it's greater than that but the way in which we obviously built the highly available platform and why we use cloud foundry as well was we looked at it from a region perspective of we didn't just want to be constrained to actually on-premise we want to then be able to look to the future where we could potentially push up into the cloud into using things like AWS, Azure or the Google platform. We're also continuing still to integrate with all our legacy services so within here although we have our availability zones we still need to leverage things like Active Directory and some of the shared services that we have within the environment. So how do we actually go about building this so you know how do you build an availability zone? One thing whenever this was being built as it says party operational principles is you know we always heard a lot in business about you know we want this this five nines concept. So whenever it came to actually looking at that and saying what we were doing in the legacy world whenever we're actually trying to achieve that we weren't really giving the highly available platform and solution and that's where cloud foundry has allowed us to actually do that. It's allowed us to drive down from whenever there were services issues that could maybe result in high availability issues or disaster recovery going from maybe 30 minutes to an hour for even some of our critical systems down to an enterprise that was completely active active. So to start out with everything has to start somewhere so what we started with we started with the first physical server and from that we added more servers and created a cluster which then gave us a lot of capacity and a lot of compute and what that actually looks like them within our racks and cabinets is we have 37 servers 756 CPUs and 14 over 14,000 gigabytes of memory you know pretty powerful for what we're trying to do and achieve bearing in mind we were just in the initial stages of building the platform out and starting to bring in our digital transformation and bring our services and applications on board. From that we then needed to add our storage from the storage we then connected the switches to them and then we actually virtualized the platform so all and everything that we run on our cloud-funded appointment runs on VMware. So we then add in a pair of low balancers so as I said before within it we have our two security zones we have our internal and our DMZ so we have pairs of low balancers for each of those availability zones and this was to ensure that you know static rich weren't being added so that everything we knew was coming directly and through the low balancers and making sure that they were actually separating out things physically so we were meeting the requirements of the business from what they were sent from a security standpoint and to this and to secure the environment we then added firewalls and again with a pair of firewalls for our internal and for our DMZ and that's really very much looking at it dependent upon the type of traffic that's coming in if it's internal or if it's external and we'll see that later on further in the slides of what actually the environment then actually came out to look like. Then from this we also build our DMZ and public security zone and we have our internal network and restricted and confidential security so how that then looks is basically one rack initially which was then connected up to the internet to our extra net core and to our internal core so where we have our two feeds for our internal clients on the old state network and what users will come in from the internet we then also built that out four times and then had that all connected directly in from that we then put cloud foundry on to the availability zones so as I said before whenever we have our availability zones in place there we have our cloud foundry DMZ and our cloud foundry MPN and each one of those is a separate foundation so there's no traffic and there's no communication that we have between those those zones to actually do that what we use is we use we use concourse and we're also we've also developed an ops manager CLI tool which actually helps us deploy that and we've just actually open sourced that and released that in the last couple of weeks so how does this compare with our data center as it is now well previously in our data centers everything would have been very would have been very fragmented this process has allowed us to consolidate the environment to actually bring in a rack and roll straight into the data centers it removes the potential for failures so we actually isolate the failures so if there's a failure within one availability zone the traffic is up and running across everything else there's no impact to our applications and to those users so they don't they don't see an easy going offline as any issue and it also allows us to be a lot more portable so as I said before if we wanted to move this stuff into the cloud we can then do that you know looking back at how all state would have done things before whenever we look at the isolated failure domains that wouldn't have been the case there would have been the case that there was something wrong on the part of the network that would have taken out the majority of the systems and resulted in disaster recovery we've also made sure that whenever it comes to things like power everything is separated out so we have two feeds with two separate power lines coming in as well so looking back at you know day one and what we actually looked at and what we wanted to try and achieve well the first thing was we wanted to target availability at 99.9 percent so that's really saying what we're giving is we're giving a downtime of eight hours and 45 minutes per year and that was a real change from the traditional model that actually existed traditional model we used to hear yes you know we want three nines four nines five nines but the data center and the way in which the infrastructure was built didn't allow us to actually handle that and give that level of service to what our users and consumers expected disaster recovery is automatic it's derived from the availability of the platform so if one of if one of our regions goes down we still have an active active traffic we still have everything going across to our second data center so in the event of a disaster in one there's no need to panic things are still going to be running functionally with capacity planning again we're looking at a true cloud provider and platform system and experience so one that we're able to actually look at the number of applications that are on the system and we're able to aggregate that against the the the amount of capacity that we have and if we need to build this out further then we can go back we can look at our model and we just rock and roll the next the next level of equipment into the environment and also maintenance releases so previously maintenance releases would have been something that we would have had numerous teams on and you would have had your application teams you would have had our operations center you would have had maybe your server admins so whenever a piece of work was actually being carried out everything was being taken down within the environment that we have whenever we're carried out any piece of maintenance to guarantee the level of availability we actually roll up through the availability zones themselves so we carry out a piece of work on one availability zone then we move on to the next so again not impacting the experience that the users have and giving a real true uptime to them so this is an overall architects of what our environment looks like and as you can see it's it is segmented so we have our we have our internet user and they'll come into a low balancer and we have our mpn user and from those dependent upon the type of traffic that they're they're looking at so for example although we have our mpn user and they could actually come in through through the top as well as the bottom that's all dependent upon the type of security that's there if there needs to be a form of authentication for the application then they can come in through the top level gslb and we have isam in place and applications that actually sit on our dmz components that acts as like reverse proxie applications so that we'll then route the application through down to the actual service that sits within our internal network itself so some of the other keys to really take from it is as well as building this we also came up with some ideas around to say you know our cloud foundry application architecture so what we said to our application teams was okay whenever you're doing a deployment if it's just something for internal what we'd like you to do is deploy a minimum of two instances across each of the availability zones now you may wonder why we actually did that well one of the one of the things that we found at the start was it was through some testing of some of our application teams if they were doing some work and the availability zone was up but they had maybe taken down they just wanted to see something they'd taken down their their application on one of the availability zones they get sporadic issues that start saying like four or four hours but if we had the extra one there so for example if it crashed and hadn't restarted if they have the extra one there whenever it comes to our load balancers and then sending that through to CF itself that doesn't pose any problems but one of the challenges that we still do have within within it and it does you know it will tend to affect the data services is we've had to still push our database application you know our databases for persistence still have to be outside of the cloud foundry environment which is a challenge but it's something we're actually working at at the minute so we're we have a development team in Tempe that are actually looking at how we can integrate that in with Redis and RabbitMQ and using some open source technology to actually give good data persistence across our environment once that comes in that again will be a real real win for what we're trying to do so there is some differences then whenever we have our internet facing authenticated applications so previously we said okay whenever you're doing a deployment you have to deploy it eight times well now you have to deploy it 16 times again the reason behind that is is because of the proxy apps that we have sitting within our DMZ so if no actual service is running in our DMZ environment it's just really acting there as a running through to our main service which is sitting on our internal network and for that we're using we are we're using isam and what was said to our teams is well okay if you require any authentication you have to use this model and method if you're just doing pass through and you're just letting anybody within the internal organization actually access the system then it's good you know you don't need you don't need to worry about this so moving on to some of the challenges we had around you know highly available application deployments so let's see within our infrastructure whenever application teams come to actually pushing out the code they're having to push that across multiple foundations what we wanted was we wanted you know a zero downtime across those multiple locations using a blue-green deployment you know we wanted the apps to be consistently deployed across those environments and we also wanted change management automation you know we didn't want a system where it was going to take an inordinate amount of time to fill out paperwork to say that you're doing a change we just wanted something that if a developer needs to push out their code they can do that quickly and all the relevant change management process is work through so how did we achieve that so to achieve that we had a team in Tempe Arizona and they developed a product called Conveyor and the prime build of that was to deploy the applications to actually push them across the multiple environments using the blue-green deployment you know to fully automate the creation of auto open auto close of the change records but more importantly and key to it from a from a highly available application perspective was to also handle auto rollback functionality if something doesn't work in one of our foundations it makes sure that everything is rolled back to how it was prior to that release and the great thing about it is it was the first piece of open-source software that all state actually released so if you have a look there it's called deploy a dactyl and I recommend that you go have a look at that it may be of use to you you know feel free to fork it and make pull requests you know we want to obviously you know give back to to open source community as well so what does this look like well all our users use Jenkins further further further pushes and initially it was just a single curl to this conveyor api that api went opened up the change record went and got the artifacts from Artifactory then carried out the deployment and actually pushed that out across each of the zones in turn and it would only ever push it out to one zone at a time until that is finished and there's any you know for example at the first step in our first region and availability zone if there was a failure there it wouldn't try and carry on it would just straight roll back even if it got to the last one in field in region two again straight roll back so we're guaranteeing that using the blue-green deployments that we have a truly highly available platform so what are some of the early successes that that we actually had within this setup to make sure that we weren't impacting the applications so we needed to do a retrofit of our power distribution units on the rack and roll cabinets and it was for future maintainability so what we actually did in data center one was we did that live we relied upon intra az resiliency we did the swap out of the PDUs and within that we had zero downtime and zero impact to any of our applications in the environment so nobody at that stage knew that we were actually carrying out any work when it came to data center two it required a full part out but again whenever we did that we did that over two nights at six hours at pace starting in the US and finishing off in Belfast again we had no impact to the environment to our users so the true cloud experience the true experience for users that their applications that they needed to access were there without any impact to them or to the business and the result of that is you know we can see now that you know we can complete the maintenance with zero impact to applications and we're also doing that as we obviously use concourse for deployments as we roll through the environments and actually start to upgrade the versions of cloud foundry that we have we know that that isn't going to impact the users themselves the services are going to be available there'll be zero downtime and it means that we can then truly live in to the idea of 99.9% availability which as I said before in the legacy world of our business we weren't able to achieve we like to say that we could do that but it wasn't something that we were actually doing so I'm coming up just to just to the end of the slides here so with some further information about some of things that that all state are doing so if anyone is interested you know feel free to have a look at those you know there's a good talk by by our director of client engineering Matt Curry and Alan Moran around how we use concourse spring one summit we've also got deployodactyl which as I say is their first open source piece of software we also do have a Ruby gem called ups manager clay which again is open source so feel free to have a look at that as well and that's what we're using again for for deployments and probably one of the things that that you know we've seen as part of being part of all state and the transformation that's going on and a good talk I think is is Matt Curry's talk about branding the culture around technology and the cultural transformation that we have actually gone through so that that runs everything up that I have does anybody have any questions at all yes so what we have is we only have one deployment with within the availability zone so each one is its own foundation any other questions so we'll be able to scale anything within our local data center so anything with anything within that region we'll be able to scale that up and it's something that we've worked very closely with their network guys to make sure that we are getting that low level of latency within our region then when we're going out seeing exactly what that is as well yes that's all auto approved yeah so so what actually happens in that process is the changes the changes is opened and actually has the product manager as the approver so it's all tracked and traced so what what some of the application teams actually do is they'll actually have within their Jenkins job a piece that will actually add that approver in so the product manager proves it so it's all audible traceable throughout any other questions yes go ahead architecture yes the different countries in different agencies don't talk to each other at all correct and you mentioned that you might be moving the state for databases as such as yes yeah but then you'll be forced to yeah exactly so we're going to be forced to open up certain things to lawyers to actually do that so with a team with a team right in the US at the minute that they're actually looking at that and working on that so they've actually got that working locally using using Bosch which is pretty much probably what it's actually going to end up on it will just be a Bosch deployment that will sit within our availability zones but not within the cloud foundry infrastructure itself so the data stores that are traditionally used today are Oracle and SQL server yeah so there's async rep and so for SQL server they use database mirroring but it all depends upon the application as well so so what they actually have is depend upon the tearing they may use they use this nasty term that I don't like is called remote recoverable which is basically well for fails you're going to have to back up and restore yeah correct so yeah so the problem that that we're also having on that we want to calculate is we want to make sure that whenever somebody comes in to an availability zones making a request they'll actually stay with inside that whereas at the minute whenever somebody comes in they could be going elsewhere which is a big challenge from our data store perspective you know we have some large systems that whenever they're actually coming in they could be going into region two the primary databases in region one you know and it provides a big challenge because it increases the latency there therefore yeah it is yeah it's a good problem to have and a good problem to really you know try and tackle and solve as well any other questions isn't I'd like to thank you all for your time I appreciate you coming along if you do have any other further questions you know do feel free just to reach out to me I'm more than happy to answer them for you thank you