 Hi, welcome guys. My name is Stephen Armstrong. I'm principal automation engineer at Paddy Power Betfair. This is Dave Buckley, infrastructure automation engineer at Paddy Power Betfair. First off, just a bit about who Paddy Power Betfair are. We've actually only existed as a company for a couple of months. This is the merger of Paddy Power and Betfair back in February. Stephen and I come from the Betfair side just so you'll have to forgive us if in this presentation we refer to Betfair as opposed to the rather wordy Paddy Power Betfair. Betfair is famous for being an online sporting betting exchange. It allows punters to bet against other people in the world through an online platform. We've got some of the figures there which are just for Betfair to give a scale of the sort of traffic we're receiving on the website. We have 1.7 million active users getting up towards 150 million daily transactions. A lot of our traffic goes through our API. We're processing about 3.7 billion daily API calls. As you see at the bottom, we've got details of the merger with Paddy Power in February. We're now a FTSE 100 company. One of the top 100 companies trading on the London stock exchange. A bit about the stack. What we've done is we've put together a red-tap open stack implementation. At the compute level, we have HP boxes. There are DL360, Gen9 boxes. We run that for all our compute. Basically, at the top of RAC for networking, we use Arista. Then we've got our SDN controller which is Mirage Networks. In terms of center integration, we have pure storage. That's our old flash solution. Then we use NetApp as well for our NFS requirements. In terms of monitoring the stack, we use Sensu. That hooks into Solometer as well. A bit about how we're actually running this in production. We have an active-active data center. Coming in at the top, we have UltraDNS. That allows us to balance traffic between DCs. Coming in at the top, you've got our SRX Juniper. That's our external firewall. We then have two layers of Citrix load balancers that come down. Integrating with that, we've got Arista at the top of RAC with our SDN controller with Mirage. We run two open stacks per DC. We have one for our tilling and monitoring. The reason we then have our infrastructure open stack, which runs all our production workloads and all our test environments. We chose to do that because we didn't actually want, if we had an issue with open stack, we wanted our monitoring to be in a separate open stack so that we could actually use it to monitor it accurately. Underneath that, you have all the compute, which is the HP stack. We run KVM for everything in terms of hypervisors. We'll use the HP boxes for bare metal later on. We also have pure storage in NetApp so that we use for our performance workloads. We have pure storage where we'll mount sender volumes to give us a higher performance with all the flash solution. Then we're looking later on with NetApp to introduce Manila. For our NFS. We have RDO installations to actually install it. You've got one RDO installation per open stack. What we really wanted to do in the driver behind this was to set up a self-service model. Each individual team will get their own tenant. We wanted to give them the ability to create their own flavours and size the boxes the way they wanted. Basically, they get to specify the CPU ran in this space for each of the VMs themselves. We wanted to give them a self-service model so if they had an application that required bare metal, they can get that and we'll be utilising Ironic for that. Or for the virtualised state, we will have KVM. The other things that we wanted to do, and this was a new thing for Betther, was we wanted to give teams the ability to set up and self-service themselves with their own firewall rules. Rather than having to wait for a long week times for a network team to self-service that, we've put that in the hands of the developers. They'll create all their ACL rules using NuAge. The other thing that we've got as well is for load balancing. We allowed developers to swap in the boxes that they want, that they provision onto bits and basically roll releases through on the bits. In terms of our tool chain, this is how it looks. A developer or an engineer, an infrastructure engineer will come in. They will check changes into GitLab. That will trigger a continuous integration build on a Jenkins slave. That then rolls up into a manifest file that's pushed out of factory. Then, basically, ThoughtWorks will schedule this. We use Ansible for our orchestration, and then Chef for our configuration management service out. We'll walk you through that workflow with a demo, because I won't go into too much detail just now. How do teams actually integrate with infrastructure? We use Ansible, so it's a self-service file. It's a static inventory file. Here, in this example, you can see that teams will specify that they want five images. You can see that with the re01 to 05. Then, they specify on that same line item. They specify their flavours of the CPUs, RAM, and disk space. Then, they specify the image that they want to use. We offer for our Linux stack, sent us six and seven images, and then we basically tag all the boxes with the particular run list. One of the things that we wanted to do was give development teams uncontended hypervisors and allow them to chop up the CPU the way they want. Those hosts there that you can see basically are split between racks. That's two hypervisors. They actually specify what hypervisor they place those VMs on. We use Nova Scheduler rules to do that. Now, it's demo time. What we're going to do is walk through our workflow. Hopefully, the demo gods are keen to us. We've ran it 27 times already this morning. Hopefully, 28 is lucky to. Starting off, we have our traditional CI build. We have our, in this instance, we're running reman. That's our TLA acronym. If we just click on that. Basically, what this does is it produces our RPM files that we're going to install in the box. If you just browse through into Artifactory. This has taken us into the Artifactory build browser at this point. The build browser, you can actually see under publish modules, you can see that it's produced one artifact. That's a specific RPM for that particular build. If you just go through to show and treat, you can see that this is the folder structure. All our builds, 20, 34, 35, and 37, you can see here. We just go back. That's how we produce the application RPM. It's just a traditional CI build. What we wanted to do was co-relate the Chef recipe that we used to actually install that RPM. If we just click through. All we do at this point in this CI build is we actually just tag it in GitLab. It's fairly simple there, so that we co-relate that RPM with the Chef recipe. If you just go back to it. The next thing that we wanted to do was treat infrastructure as code. We basically have our Linux CI build. What this does is it actually produces a CentOS 6 and CentOS 7 image. We use Packer for this. If you just go through. Again, using the build browser. This is all very simple. Go to publish modules. You can see that we create two QQ images that we will then upload to Glance. What we've done with this step is we've basically patched everything as part of the CI build. We've basically hardened it, security hardened it, and we've coalescanned it as well. When it goes out to the delivery team, it's all patched up to the latest level. If we just go back. The next thing in our step is our SDN. This was new to Betfair, so we've utilised a new edge for this step. All that we're doing is we're tagging the specific ACL rows at a particular point in time. We're basically just using a GitLab label on that. If we just go into the config, we'll show you what that looks like. Underneath New Edge, we have all the different teams, different applications. In this instance, if we just have a look under Reamon, which we will be deploying today, we've basically got the ACL rows. If we just select the ACL rows. What that does is, for every single environment, you want it to be a consistent set of ACL rows so that you test the same way in quality assurance environments as you do in production. In this instance, it's fairly simple. The developers would have to open up any port 80. This is probably a bad example because we wouldn't allow that. In reality, our rules are a bit more locked down. What we do, if you just step back, we've got a common set of ACL rows that we apply on top of that. Basically, the common set of ACL rows will allow access to DNS and everything so that developers don't actually have to set this all up themselves. They're only specifically focusing on the application. New Edge has a very good way of doing this. It uses network macros. Anything sitting outside the overlay network you can actually integrate with. For instance, if we've got applications that are sitting in our native estate in Citrix N, we can basically just connect to them using network macros. We just go back to the Rheemann SDN contact. Another important thing that we've introduced here is we have micro subnets. If we just have a look at QA, what we do now is when we spin up boxes, we have AB subnets. The first release will go into the A subnet and then basically the second release will go into the B subnet. We do a full tear down of that particular network and subnet every time that we deploy. That means that ACL rows are always up to date and we'll demo that in a bit. You can see that we've got a slash 26 micro subnet for each of them. When we're on board teams, they will just fill out these YAML files, which are just var files for Ansible. We just step back. That's all tagged at the specific version, so that's how SDN plugs in. The last one that we've got, the second last one actually, is our wood balancer config. Basically, again, just tagging at a specific version in GitLab when there's changes to that GitLab repository. If we just go to the load balancer project, for developers, what we're trying to do is make this as consistent as possible. Where you had the new edge before, you've got net scaler this time. Underneath it, you'll get the particular team. In this instance, BFRE. If we just go into the QA1, you'll see that we specify all the net scaler-specific load balancing rules. We've got our LBV server. We've got our service. We're serving the services on port 80. Then we've got our monitor, which does the health check in the boxes. Then we've got our service binding. One of the important points is at the bottom, where you've got your role percentage. Ansible allows you to do rolling updates in the box. That'll take 50% of the boxes out of service, update them, and put them back on the load balancer. The final one that we've got before we kick off the pipeline is our common workflow, say, I built. What we wanted to do was have a centralized mechanism of where all teams spin up VMs in the same way. They create LRFD networks in the same way because we didn't want them to be running different snowflakes in different scripts. Basically, we provide those set of common workflows action, so we're just tagging that specific version as well. How do we roll this all up together? Each particular TLA will have their own TLA package build, so that's the application package build. If you just click on build with parameters down. You can see that basically what that's doing is it's pulling in all the different CI builds, the different versions. That's basically the trigger for the Riemann package build. It would be a developer check-in from the Riemann CI build. That would then roll up. Then it would consume everything else at the particular version. As soon as they do a check-in, it will just roll up the next time that they deploy, so they'll just consume the network config, the load balancing config, and any workflow actions that have been updated so if we just go back. What that actually outputs is a simple JSON file. The JSON file just has all the versioning in it. Basically, it's just tagging. It's got all the tags of all the GitLab repositories. How does that roll up? Once that's actually triggered, just go back to it, and go through to Artifactory again. It goes through the build browser. If you remember we had ThoughtWorksGo that was basically pulling. It pulls this repository which is a release repository where all the manifest files sit. If the brand new manifest file is generated, then that will trigger the pipeline. ThoughtWorksGo is actually pulling this folder. If you want to do a rollback, you rollback to the previous manifest file which has all the code, the chef recipe, your SDN config, your load balancing config and your platform template. You're rolling everything back at that point. You're not just rolling code back because something might break. What we'll show is we'll kick off a package build and then we'll do that before we do that. We'll just show you that inventory file and how the teams actually integrate with it. For instance, a team that wanted to deploy their particular TLA they would set up their package build and all their CI jobs. We used Jenkins job builder for that. Everything is in source control. Everything that you see in Jenkins is in source control as well. We're about to show you ThoughtWorksGo so we have ThoughtWorksGo pipeline builder so we build all of that from YAML2. Before that, if a team wants to integrate with this what they will do is they will specify in this cell service inventory file they will specify a new line item for their particular deployment. In this instance we have Reamon QA so here we're specified five boxes. Can you just highlight that too? This specifies five boxes. That's actually our naming standard so that's the IE1 data centre QA and then we've got our flavour that they specify so it's their VCPU RAM and disk space and in this instance they want to use the CentOS 6 image so all the information and the version of that CentOS 6 image comes from the manifest file so they don't need to fill that in they just consume the latest and greatest. So what we do is we tag our boxes on the profile that we want so when we spin up our VMs we tag them with the metadata and it basically specifies the run list. Then if we just go along we specify what hypervisors that you're going to deploy it on so in production we'd have two hypervisors so that if we had a rack failure it would not take down the whole application so we design the whole data centre for failure. Okay after that filled in we're ready to kick off our package build so this would automatically trigger you wouldn't have to do this manually okay I'm just going to check, I'm signed into horizon okay so basically you should see this trigger in a minute so there's a polling interval on ThoughtWorks Go and then we should see it kick off so yeah I'll just wait a second and it's probably going to take longer because come on 28 it'll get there this happens every time I swear every time in front of an audience basically it's got a minute pooling in through so sometimes you hit a spot on like that you're there at other times you wait for 60 seconds one of those times seems to be now come on demo gods you can do that there we go right so the first step of the pipeline is it downloads the manifest file and what it will do is it will pull down from all the GitLab repositories and then basically assemble Ansible var file structure so then the common workflow actions which are just Ansible playbooks then run across that so the second bit which is set it up prerequisites OS that's going to create our flavour and assemble our host aggregate so we dynamically assemble the host aggregates each time so we'll just wait till it gets to that step so if we just go in to OpenStack at this spot so you'll see so every part of this is completely immutable we take it down and then bring it back up so we're building everything from scratch each time so if we just go to the flavour you'll see that as part of this that's OK we generate the flavour so it's a private flavour for each particular tenant so this is all using Ansible 2.0 modules and we've wrote some custom modules but we'll take you through that later so that's a private module for that particular tenant we also assemble the host aggregate so you'll see the particular hypervisor underneath that so underneath that you can see that that corresponds to the particular inventory file so we use the extra specs filter for this so basically what we do is we tag that host aggregate with metadata that says as you can see there you've got TLA I1reminqere the flavour is also tagged with that specific metadata as well so if you spin up using that particular flavour it will place the VMs on those particular boxes that's kind of invisible to the end-users so all they care about is specifying what hypervisors that they want to assign so we've slowed down a bit and put a manual step in because we always take too long at that step that would normally just go straight through but what I wanted to do was take you through Nuage before so in Nuage and again we've got logged out it's good that it logs out shows it's secure right that was one for the security team ok so basically at the moment you've got your B deployment sitting there so what we're going to do is we'll push this on and it'll create the network so create our A network because it alternates between the two if you just go into the network there and then show the policies this is how Nuage works with the ACL policies so you've got your ingress and egress rows so basically for each of them we just specify that this corresponds directly to the Nuage, the OpenStack subnet so it's a one to one mapping and then you can see the common rows that we've seen before and incremented on top of that you've got the application specific rows so if you just click on the pipeline so we should see it pop up in a second and then we'll actually see the VMs for the new release being placed into the subnet unless someone's stolen or go aging yep so you see the brand new network pop up we go into OpenStack you will see the codisbanding network but we won't do that just now because we want to see the VMs pop up so that's basically applied the brand new ACL policies to that particular subnet so if you click back to the pipeline click there should go on to the next point yep so go back to OpenStack so we should see the boxes actually pop up now so if you remember back to the inventory file the static inventory file we actually tag the boxes with the specific profile so you should see them pop up in a second yep there you go so they've all popped up and then if we go to OpenStack you will see them codisbanding one under so what we're going to focus on here so that's just popped up so you've got your five boxes and you can see the particular naming standard with them as well so if you just click through there so the important thing to note down here is it's tagged the box with the run list so what we then use is we use Ansible dynamic inventory on the chef run so go back it should have went on to chef run so that's actually uses that tag to specify what chef recipe to actually execute on the box so it pulls it from the metadata in OpenStack so that's very important because we can basically just tag boxes with the profile run it over and basically install the software that we want on that okay and other things to note is the way that we actually filter on the boxes so just go to instances there is we tag them with we've also got our sensor checks as well so any checks that we're specified we haven't actually specified any for this because it's just a test application but we've got a subscription and checks all monitoring's actually set up this way as well so we use Ansible to tag everything on the box so another important thing the build idea on here as well as specified so that allows us to make different decisions so that's important when you want to actually roll boxes on and off of the load balancer so we use Ansible to basically filter down so you say return me all the boxes in this group so you've got the group there and then you say filter down just give me the boxes with this specific build version so anything not equal to that roll it off the load balancer so if we go back to the load balancer that's still running chef so this is a stage that varies per application so we're on to the create bit so the create bit what that will do is go to the next scaler and actually set up your VIP using those load balancing rules that we specified so in this instance we're doing round robin as you as the development teams would have specified in their config file the rolling update then uses that Ansible dynamic inventory to basically filter on the boxes and then roll those particular boxes into service the 50% roll rate so roll those boxes off the load balancer and then finally we will clean up the previous version of the subnet and the VMs generally what we've got is we've got our testing in between that so we've got a test phase where a test pack would actually run so that's doing the rolling update and you can actually hit that particular VIP now so that's basically how ugly reamon looks which is the test application that we've launched so that should be a zero downtime deployment as well so it shouldn't affect live service because you're doing that rolling update and what we should see is basically it will kill all the VMs with the previous version and then kill the subnet and then await for the next deploy and the subnet should follow in a second because we're cleaning up the ACL rules before so that's why that's light delay so that's it done and then for the next bit it's basically going to promote it onto the next stage of the pipeline so if we go back to the package build in a second you can see the promotions so if you hover over that's the previous build so that went through QA and it's then the right dive and it's went through to our integration environment so in a second with that promotion you should see a star pop-up so developers can look at their package build and actually see what stage it's went through and we actually associate so it's just donated the star for the next stage so if we go to and what that then that will do is move the manifest to another folder location and the next stage of the pipeline will pull that and kick it off so that's our green pipeline so the demo gods were good to us and then basically the polling gods won't be good to us because that will probably take a minute to pull the next stage and then that will go all the way through to production and that's how they deploy okay so I think we'll just wait for a second to show that going to the next stage so I think at the next bit what we'll actually take you through is some of the Ansible modules that we wrote and some of the contributions that were done to the OpenStack community through Shade I hope they'll believe us if we just go off this we'll come back to it we'll come back to it it should deploy through okay so this was basically just the timing so enter that takes around seven minutes most of it is the chef run and obviously when you're demoing it it seems like it takes a little longer so that's how the stage is so first stage pull down all the prerequisites and you then set up the flavour and host aggregate you create the layer free network with your specific ACL rules you then launch the VMs into that network you then tag the boxes run chef using that profile of the tagged boxes you then create the VIP roll those boxes onto the VIP rolling the previous version off clean up the previous versions VMs, ACL rules and tear down the network waiting for the next one promoting it to the next stage can we flip back now and see if it's went on yep so there you go it's triggered onto the next stage and it's just going through the pipeline okay back to Slade so Dave's going to take you through what we actually did with OpenStack and Ansible modules and some of the things that we've solved using this methodology yep I know we've not got too much time though so I'll try to speed through these so yeah basically as Steve said we're using Ansible for all our orchestration through the pipeline so all the interaction with OpenStack is done through Ansible modules we like Ansible because it's open source it's easy to use it's basically we've got a real mix in that for some for some operations we're using directly the Ansible Core upstream modules for some things we've found that we've had to patch modules create our own custom ones based off the upstream version and in some cases as well just made our own from scratch using the Python Shade library so yeah basically we have in that pipeline every single stage was basically a running an Ansible playbook so we're using that to create the VMs create the host aggregates arrange the hypervisors into the host aggregates and then do the teardown of the VMs as well so yeah, advantages of this we're treating infrastructure as code everything's source controlled everything's tagged and versioned which helps with the rollback as Steve described earlier we have the self service model so teams create their own flavours they want to increase the number of vcpus they just have to fill in a config file and then redeploy their application so there's absolutely no wait time on you're not waiting for a ticket to be completed by the infrastructure team and you can scale down and scale up at will just by changing the number of boxes in that config file we also use the notion of an availability zone in open step to segregate our test environment so QA is completely separate from our integration environment completely separate from our production environment and because we're using the same playbooks to deploy in all those environments it means we're confident that by the time we get to production because everything's been deployed in the same way you can be confident that there's no problems with your code a bit about the image automations this is how we basically have a pipeline to create our linux and windows images which get pushed up to open step glance again, as Steve said we're using the packer builder by HashiCorp which is really cool so basically we have a keymew builder plugin for packer so we run that on a hypervisor where we have nested virtualisation enabled so that enables us to build those base images and then we use the open stack builder plugin basically spin up instances off those images in open stack we run chef on them to install all the prerequisites that we need on those images and then the images that get generated from that deployment are uploaded to glance in the full clouds across R2DCs again, infrastructure is code everything is source controlled this is a really big thing for us so when you roll back, you're rolling back absolutely everything it's completely automated so teams will just consume the latest version that is successfully latest version of the image that is successfully passed the pipeline and there's no in place patching whatsoever so all the patching is automated in that pipeline so teams don't have to worry about that either they'll just consume consume it in place as is and we also have the security hardening on that pipeline as well so one of the end user benefits is that basically teams get a VM in 10 seconds for Linux you have to wait a little bit for windows a couple of minutes no patching in triplings to applications so it's all about the devs the devs are end users and we're trying to make life easier for them so the SDN again, it's a similar theme so in this case there are no Ansible modules for Newarch so one of our guys, Mario Santos has been an absolute machine and wrote about 43 Ansible modules using the there's a Newarch Python SDK so essentially writing a wrapper around those so this carries out all the orchestration in Newarch so to create subnets, to create ACL rules we also have a day one playbook that effectively builds out Bethvers Estate so if we ever lost a data centre or we've moved into a new data centre we'll just be able to run this playbook and generate all our Newarch domains and basically redeploy our entire estate within minutes I guess seconds and benefits again infrastructure is code teams can view the ACL rules so it gives us some insight into what applications are actually talking to each other because I know in the past that's been a bit of a black box for the devs at Bethfer and again you have easy audibility of those rules so by default we have a basically a denial on Newarch and then you basically open up the specific ACL rules you need so everything is blocked except for what applications need to talk to one another on what ports and again, yeah for the load balancer it's a similar thing we have network as code so we've got 38 modules, not quite as many as Newarch but to carry out the creation of all the objects on the load balancer and basically do that creating the BIP and the rolling update in our pipeline and again all those big thing key thing for Ansible is that you make all those modules idempotent as that ensures that what you have in source control in your config files completely exactly matches what you have deployed in production and in your testing environments it's a similar theme again treat infrastructure as code, version everything and make basically enable clear visibility into what is what first seems like a complex network structure we have at Bethfer now we're source controlling all that giving visibility to the devs to all the infrastructure guys and yeah, so they can understand network changes sorry, that's cool so just taking you through what's next for us, so we're basically nailed down our VM workflow so we're going to be looking at our ironic bare metal provisioning we're also going to look at offering a container as a service offering to devs so we'll be looking at Kubernetes for that one of the things, because we've been rapidly doing this, we've basically done all of this in six months so it's been a fairly rapid ride so we want to take some time and basically open source everything that we've done so we've already done some pure reviews for the new age modules, we'll be looking to talk to Citrix to do the same for the load balancing modules and basically everything else that we've done for Ansible Shade, we'll be looking to put back to the community in terms of next steps as well we're going to be looking at new age networks for our third party VPN so that can bring it into a where free network so we're waiting on the new age upgrade for that also, as mentioned before, we're using NetApp for NFS so we're looking at Manila project for that too so I just wanted to give a shout out to all these people because without it wouldn't be possible so thanks to our Red Hat guys our new age guys and the guys from Computer Centre as well in terms of vendors and thanks to everyone at Betfure that made that possible so I won't go through all the names because I think we're running out of time but yeah ok so the next point is questions so as you can see Dave had some fun with his cowboy hat there in Texas yeah so if anyone's got any questions if you want to take a moment, I'm not sure how much time I've got left though I can't even see up here either for the new age what product in particular you use you use the VRS for new age so we use the VSD VSC, VSD ok so it's not the VRS that um yeah we use the VRS any reason why you don't use the REST API instead of using Python SDK so the SDK basically everything's written in Ansible and that's Python based so basically what the Python SDK is is just wrapped REST calls anywhere people like Python because it's awesome OpenStacks Python right so six months you put this process again so what is your what is your development methodology within the DevOps team and in terms of development methodology what we do is we don't have a DevOps team so essentially what we do is we try and facilitate the relationship between the devs so it makes it easy to consume so this is just a self-service model so one of the drivers for this was basically to give the devs AWS-like ability otherwise they'll just go to public clouds so it had to be easy for them one of the things that we've done with this is use the same principles that devs normally use to actually interact with code so it makes it easy for them so it's quite native so filling in config files wasn't difficult for them and basically using GitLab's second nature so that was one of the main drivers on it in terms of collaboration we're running an onboarding process so basically we work in an agile method so recently what we've had is some of our teams have basically came in for a two-week sprint to onboard their application and I think we've got some examples of that as well so we've got our exchange mobile site that's running on OpenStack and also we've got our CBR application so we're in the first phase of this so about 10 different applications are going to go on there in the next month or so but we've rottled traffic through the platform and it's going well so basically we switched the MS on our exchange mobile site and did a quarter of a million transactions in a two hour period and the performance that we're getting out so it's pretty phenomenal which is then to Arrester and New Edge with our Weave Spine topology so I'd recommend it to anyone anything else? What do you guys do for your network underlay network and how do you guys provision the switches and so basically we use Arrester Zero Touch provisioning for it so we've got ZTP that we use for that as well so essentially as well we use Red Hat director so basically we're looking to we've done some customisations to that to deploy our Weave Spine topology in the network so we use RDO to scale out all our hypervisors as well anything else? Any other questions afterwards just come and grab us So what do you guys use for troubleshooting and things like that? What happens when things go wrong when you're provisioning this stuff and things go wrong, what's your recovery model I guess? So basically that's what's been very important for us is Ansible 2.0 so what we do is we use the block rescue functionality in that so if you think about it we've got the clean down so if we have a failure we basically just rescue it and then it will tear it down in a particular environment so we build that into our common workflow actions to do proper cleanup as well OK, well thank you very much for allowing us to present today Thank you