 Hi, welcome today. Today we're going to be talking about continuous delivery of stateful applications using Cinder. But before that, a little bit about our company. So we formed via Merger in 2016. We have offices all over the world in the UK, Romania, Portugal, Ireland, Malta, Gibraltar, USA, and Australia. We have an engineering blog, if you want to check it out, where we talk about various tech that we're using, post various articles on there on a weekly basis. We have over a thousand engineers at the company, and what really makes us different is the amount of transactions that we do. So we do 135 million daily transactions on the platform. We do around 30 billion API calls a day. So these figures are taken from our last Grand National, which is the equivalent of our Black Friday, where we do around three and a half times the traffic that we normally do. In terms of log data as well, we log a lot around 2.5 terabytes a day. And in terms of OpenStack, we are now building a 100K core OpenStack implementation with around two petabytes of storage. So in terms of the continuous delivery challenges, obviously you have two different profiles of application. You have your immutable stateless applications, and then you have stateful applications. So some of the applications that we are deploying at the moment is our Inframix databases for our sports book. We also have MySQL databases that we're deploying immutably, and also stuff like Kafka, where you need to copy the state between the topics. So we have some specific requirements to do continuous delivery of stateful applications. So we provision all flash storage with pure storage. We hook that into Cinder, and that allows us to programmatically control volumes. So in terms of what we do around this, we need to give our development teams the ability to extend volumes if they require and do that all through the OpenStack APIs. We also need to replicate data between volumes, because when we deploy, we have our A and B deployments. So essentially, we'll bring up a cluster A, and then we'll need to federate the state across to the new release on cluster B. And we'll take you through that later on. So really, what we use for this is Ansible, and we had to do quite a few modifications to the box Ansible modules, because we're using an earlier version of the Shade Library. But generally, what we did was we went and just used the vanilla ones that were available through the Ansible 2.0 release. The Shade Library is very important for us, because it gives us backward compatibility. And what we've also done is created some custom volumes that we've contributed back to the Shade Library and the community. So some of them were OS Volume Snapshot. So one of our guys in Porto, Mario Santos, wrote that module and has committed it back. We also have OS Volume Rename, because we required to rename volumes, because when you're dealing with two clusters, you want to persist the state and then attach it. And we'll take you through that workflow. We also had to write some Ansible filters, because we wanted to tag as via metadata the volumes we want to attach to HVM. So we tried to keep it simple. So in the static Ansible inventory file, we have, for example, Volume 1, and then we have those attributes, which are the name of the volume, pre-painted with the host name, the volume size in gigabytes, the volume type, or where the volume should be created. And then finally, the mount point, as mounted on the VM, and the file system type that we want for this volume. So after we spend about VM, we can see that same metadata tagged on the volume. And we'll also look at how we actually consume this. So we wrote a simple filter. This is the implementation of it. I will not go through the code, because it's not relevant. But what it will do is it will parse any case which starts with volume underscore. And then it will give you a dictionary. That dictionary looks like this. So with Ansible, you just call the get volume filter and pass the whole metadata. And it will return you a data structure, like the one showed below the screen. Okay. So bringing that all together, we wanted to implement this in our cell service deployment pipelines. And so really, we work in the principle of using 12-factor applications where essentially, your operating system is kept immutable. And that's every time you deploy, that's blown away and torn down. And then what we do is we treat data as an attachment. So that's what we use sender for. So we attach the state to each of the virtual machines using sender. We're also looking later at using it for bare metal and making sure that those shares are mounted and backed up, obviously, because you don't want to lose your data. So really, this is how it plugs into the cell service framework. So as Vellion went over, as you can see, we have our VM naming standards here, which is the names of our applications. Then what we do is we tag each of the flavors with a particular metadata tag. We use the Nova Xrespex filter, meaning that when you spin up with that particular flavor, it will land on those particular hypervisors that are specified there. This gives us redundancy per DC, because each application is split across to hypervisors. This means that if you lost a hypervisor, you'd only lose a percentage of your application. So bringing it on, we have the Ansible role to install it. And then if teams require persistent volumes, they will specify the volume here. So if they had multiple volumes, it would be volume 1, 2, 3, 4, et cetera. If you wanted different volumes to be specified and not to have that attached to each of those virtual machines, you would just insert a new line item. So rather than having it as a common variable, you would move it into the line item and then you could have a unique one for each of them. In this scenario, each of those virtual machines will have a one-to-one mapping with a unique volume that's specified there. So how does this look in terms of the self-service stateful workflow? So starting off, what we do is we pull down all of our Ansible repositories, which is essentially all the playbooks. The second step, what we'll do is it will create a unique flavor. It will then assemble that host aggregate based on that inventory file that's been specified there. It will tag that host aggregate with the metadata corresponding to the flavor so that when you spin up the virtual machines with that particular flavor, it will land in those particular hosts. The next step is we want to check the capacity. So we do a capacity check against the hypervisors to make sure that you have enough capacity to succeed when you deploy. We also check against the pure storage, meaning there's enough to provision the volumes. We then create our A network. So we also assemble the zone in Nuage Networks. Then what we do is we create the virtual machines and place them on that network so they land in those particular hosts. We then run Ansible or Chef, depending on what the developers want to use. So we have two templates for that. This will install the application on those particular hosts. So when we launch the VMs, we actually tagged those hosts with the metadata, which said the profile of application that would be installed. So the Ansible Playbook just checks that, reads it, and then will use those particular playbooks and roles to install the application. We then create a VIP on the Net Scaler. And then what we do is part of the rolling update process. This is a customizable step for teams. So they can look after how they deal with state for their stateful applications. So for instance, Kafka would not match something like Infamix. But the profiles are quite similar in the common workflow actions. So what they do is they generally will create a little custom Ansible Playbook, which will deal with state for their particular microservice application. So after the volumes have been created, we create a snapshot against them. So this is all part of the rolling update stage. We then mount the sender volume to the first virtual machine, so zero one corresponding to it. We then bring that volume into service on the load balancer. We then do the same for the second volume. And then bring that into service as well. And then what we'll do is we'll check if that, if all the tests have passed before promoting it to the next stage. So that'll be promoted. And then this will go all the way through. This is a prod pipeline, but this would be the same for your quality assurance environment, your integration, your performance testing, and production. So this is where it gets interesting when you're bringing it in and doing the B deployment. So again, same pipeline executes. We basically set up the prerequisite. So if you've got a change in terms of the hypervisors, it will rearrange them. So if you introduce a new host, the next deployment will go to that new hypervisor. This is good for disaster recovery as well, because if you wanted to migrate machines onto a new set of hypervisors, you could just introduce that in the inventory file, check it into source control, and then push it through. Check capacity again to make sure that we have enough to do the deployment. We want to fail fast, so you don't want teams actually provisioning against OpenStack if they don't have enough resources. So then we create the B network. We then launch the virtual machines into that network. Again, we install the application onto those virtual machines. We go to create the VIP. Basically, all of our playbooks are idempotent, so there's no change there. It's basically detected that that VIP is assembled in the correct state and no change is made there. Then we go to the rolling update process again. So what we do is we will create a brand new volume from the existing volume. We then create a snapshot of it. We then drain the connections in volume one. We then rename volume zero one to old. We then mount the new volume, and then we bring that volume into service. Then we move on to the next volume. So again, we create a new brand new volume. We create the snapshot. We then mount that particular volume, drain the connections, rename the volume zero two to zero two old, mount it, and then bring that into service. So that is how we transfer state across. So for instance, this is an example that we've used for infamix databases. Obviously, there's different ways that you can do this. Some applications you would lose too much data from doing this. So you could actually just persist the same volume. But the way infamix application works is essentially it can check the cluster and then basically it will replicate the access data to it. And again, what we do is we test the cluster. If not, we would roll that back to the previous release. And then we clean up the previous. Okay, so what we're going to do at this next stage is basically show this in action in OpenStack. I know animations can only go so far. So over to Billy. So if we can switch to the other laptop, please. Yeah. So here we'll show you a simplified demo with a single MySQL note and how we can migrate the state between the different deployments. As you can see, I was trying to keep MySQL sessions alive during this whole time. So let me walk you through what we currently have. So we have our... Can you guys see this? Okay, should I zoom in a bit? Let's see. Okay. No. Sorry about this. He's got a really long name. So as we were before. So we have our A deployment. That was the old one. And now we have our B deployment that is ready to go into service. At this point in time, we have installed MySQL on the new note. But we haven't attached the volume and MySQL is not running on it. So if we flip back to here, we can see that I have an SH connection to instance A and instance B here. So on this one, if I log into the database and use OSS database and do... Sorry about this. We can see that we have a single table with a couple of entries in it on the first note. And if I try to do the same on the second one, we can see that MySQL is not even running on it. And there are no volumes attached to it. And in this simple example, because it's a simple note, if we have a look at the C name, we can see that it's currently pointing at the A node. So if we try to connect to the database from a remote host by the C name, we'll get to the current A instance which is here. So what I'll do is I will walk you through step by step the Ansible Playbook. So I'll kick this off. And I've introduced pauses in it, but we can see the implementation of the same playbook just above. So currently, we are at the stage where we create the C name volume. So this is the volume name that will be migrated across. And because it already exists, it will just say, okay, all right, good. And if we flip back to OpenStack and filter on app, we can see that we have our volume there and attached to node A. So if we, and then next step, because it's a single note, we have, it's not a zero time deployment. So we stop MariaDB on that instance. And if we try this again, it should no longer work. Yeah. And the next step is to unmount the volume from node A and detach it. The pure driver is more optimized in the new version. So this is quicker. Or so they tell me. So now that the volume has been detached, we can see that it's no longer available here. There is no slash data. And then the next step, the next step would be to create the new volume as a copy of the existing one. So if we take this off and give it a bit of time and flip back to the volumes in OpenStack Horizon and quickly filter, we can see that we have our new volume. So it's a full copy. Yeah. And here we delete the old snapshot because we no longer need it and create a new snapshot. So what we do with the snapshots is if we need it for a particular test environment, we can snap off that and create copies. So that's why we create the snapshots. So our next step would be to attach the new volume to the new instance. Obviously this is a lot quicker if you don't pilot it, it's just good. Yeah. And if we do fdisk, if we can spell fdisk out on the new node, which is a good measure, we can see that we have our volume attached. So the next step would be to check the file system on it and maybe format it. But because the volume already exists, it has been formatted so Ansible will just skip over those steps. So there's nothing we need to do there on our rolling deployments. Yeah. So the way that we've done this, we make sure that you don't have two separate deployment pipelines, one for a one-day deployment and one for a rolling update. We incorporate the two and use the particular filters when conditions. Yeah. So our next step would be to mount the new volume. And if I do dfh, we can see that we have our volume attached. So the next step is where we rename the volume from data to data old and from data new to data. So that's with the new module that we had to write with shade to the rename. So if we go back to horizon. So this was the word we had before. And when I refresh, we should have data old and has already been detached and data attached to instance B. So the next step would be to prepare MariaDB and start it off on the new node. So in here, we can see that MariaDB is running. And if we use a database from before, we can see that we still have our data. So this is the point where we also flip the C name. So you could use a load balancer for this as well. But for the demo, we've just used DNS. So we can also access the new node from the C name record remotely as well. And you also get us there. So the next, so now that the application is running on the new nodes, now we're actually good to go to go and delete the old volume. So we just go ahead and do that. Obviously you've drawn some tests in between before doing that. Which we decided to skip in the interest of time for this demo. And this is then the playbook. And now we have our database running on the new node. And the old node is ready to be cleaned up further down in the pipeline stages. So this was our demo. And I realized I actually didn't go down the playbook as I was doing the demo. Yeah, so just go for it now. Yeah, so we create our C name volumes using our volume. We stop by reusing the service module on the old node. We unmount the volume from the old instance. So this is just using all the box modules, apart from the two that we've created and said before. Then we detach using our server volume, our volume to create the new volume. Manage the snapshots as well, using our volume snapshot. And attach the new volume to the instance. And then obviously do the file system checks and formatting if need be for our day one deployments. So with this model, basically teams can substitute in their specific stateful commands in between to synchronize the data, etc. And then we also mount the new volume again using upstream Ansible modules. Rename the volumes to old and new. So we tag the instances. When we put the instance in service, we tag them as live just to know that they're currently serving traffic. So we should not clean them down. But when we originally spent up an instance, we put their status as in progress. So if the pipeline fails for any reason, we can just kick off a new one and it will automatically clean down any VMs that are in progress or old, which means they're not serving traffic. So we put them into a live status when they're serving traffic. And that's just updating OpenStack metadata. Yeah. So here we set the ownership for the data directory for MySQL. This is mostly for day one provisioning. Start the service. And this is where you put in your custom load balancing solution. In this case, it's a simple C name using the info blocks modules. And obviously you delete the old volumes. And this is pretty much it on the implementation side for this fairly simple demo. Okay. Can we flip back? Thank you. So this is really the benefits of using this kind of methodology with all of our applications. So we do around a thousand code deployments a day to test in production environments. So every time a developer commits into GitLab, it will essentially trigger a deployment pipeline for their application. So that's why we have so much churn in terms of the virtual machines that are created. And so we, in our OpenStack implementation, spin up around 3,000 virtual machines a day. And what we do at Paddy Power Betfare is each virtual machine will only last for a maximum of 30 days because we do all our patching at the start of the pipeline, promote those images, and then essentially do immutable deployments for each of our virtual machines. So we have some checks in place that check the age of virtual machines. And then if any team hasn't deployed within 30 days, we get them to redeploy for patching purposes. We've also wore the meantime to recover from failure. So the pipelines that you've seen are the same for each teams with just the customizable steps for the danceable role or the chef recipe to install the application. So the development teams own those roles or recipes because they're in charge of the application and they know it best. And then the customizable rolling update is basically customizable for each of the teams as well. And that's to manage the state. So some of the different applications you could have a rolling update that swaps boxes in and out of a load balancer. We have ones that flip over C names. We have ones that create volumes, as you've seen here. And the traceability factor is very key for us. So each of those pipeline steps is actually hooked up to a Slack channel as well, meaning that if we see any issues with a particular stage at any point. So one of the infrastructure self-service pieces, such as spinning up virtual machines, we will essentially get an alert on Slack where we can check it to see. And we also have a repeatable deployment process. So we don't allow teams to spin up virtual machines in a hundred different ways. We basically govern that centrally and then allow them the flexibility with the templates to substitute in the customizable steps that they need. And so the end state for us, I think we're a bit more than 644 hypervisors now. But the end state is 650 hypervisors per data center, which will take us over 100,000 cores on open stacks and around two petabytes of storage. Okay. And that brings us to the end of the presentation. We have a white paper that kind of documents our reference architecture on our bits and bits blog. This is just for other users if they want to go in the same journey and look at how we selected the tooling, how we selected all the vendors and basically how we put this implementation together. So you can check that out. I guess we've got time for questions if anyone wants to ask, otherwise we're good. Okay. Thanks.