 So it's the last day of the summit. I hope everyone's at a really good time and I Hope people aren't a bit sleepy We'll try and we'll try and lighten things up as much as we can We've been talking a number of conferences over the years We've done sessions on how we designed our open stack cloud how we built our open stack cloud How we moved all our workloads so that we have All of our pretty much all of our workloads running on our open stack cloud and this is Possibly the least glamorous of all the talks, but this is how we're Maintaining it running it upgrading it keeping the lights on It feels a bit like the the fourth of a trilogy to quote Douglas Adams so It's the day-to-day. There isn't going to be anything new anything Shattering any new technologies, but it is going to be our experiences of what you should and shouldn't do when you're running a large Open-Sat cloud with over a hundred thousand cores And all your production workloads on it a little bit about us and so Paddy Power Betfair formed 2015 from the merger of two online gaming companies Both with similar business models and the sports book betting space Betfair also has an exchange offering which allows Peer-to-peer betting with an open API The API generates a large amount of traffic. We don't throttle it at all. It's where our customers are based It's where our revenue comes from so we try to encourage as much access as possible This brings certain amount of scale challenges certain amount of timeline challenges There's a whole sequencing issue as well to be managed on the exchange. It isn't like Twitter It's key important that things happen in the right order It is financial. I mean, it's not financial services, but it is still people's money quite large sums of money at any given time and it needs to be transactionally safe We are a heavily regulated industry as I can imagine you you don't understand and So we have a number of constraints from gambling regulators across the EU particularly but also overseas more and more now and the US states Getting involved. We have a number of other brands just to go through quickly Fanjool, TVG and draft in the US Timeform who produce data to enable more informed betting and sports bet which is our Australian operation the primary focus of our open-stack cloud is the Paddy Power Betfair brands, however Pretty much everything you see on Paddy Power Betfair is using is running on open-stack So a little bit about us individually Yeah, so I'm Adrian Miron. I used to be DevOps in our company for four years After that, I decided to move in infrastructure part becoming a senior infrastructure engineering manager and again, we help Thomas teams to keep the lights on and give operational excellence to our infrastructure and Yeah, I hand this one to Thomas back to me I was a developer. I joined Betfair in 2006 I've been a developer. I've run the software development teams on the on the exchange I then spent a while moving the workloads onto our cloud and my current role is head of cloud automation, which is effectively Both automating our open-stack cloud and all our CICD pipelines and tooling for all the software development teams Our teams are in Four locations we have a primary four locations, but in in Ireland in the UK in in Porto and In Romania inclusion. So we it's a distributed thing. We have teams supporting the software engineers. We have about five six hundred software engineers continuously pushing Deployments that's quite a big stack just a quick slide For those of you that aren't familiar with the product If you interact with our websites, you are interacting ultimately with open-stack At any given time. So this isn't this isn't a science project. Although some of the science projects are very impressive It really is what what generates our revenue and it has been a game changer for us as an organization There's still a few things that we're that we're migrating across Mostly the legacy backlog that you get within any organization and if you're an emerged organization you get even more legacy backlog Because you have two sets of things that no one wants to switch off because they're not sure if anyone's using them At a high level, this is just some of the technologies that our software developers use on a daily basis and as you can imagine This is constantly changing There's there's various relational databases no sequel Increasingly we're seeing messaging architectures streaming as a way of removing dependencies on single data sources This as a team we have and nor should we have any control over that the software developers will use the right tool To meet the product requirements that they're solving our job is to make this as seamless as possible for them to deploy It on to production and in a resilient and scalable manner It's a bit scary at times because the the pace of change is very very fast indeed high level this is From our reference Architecture that we published about four years ago. It's in the public domain It's a high level view of how we've built our cloud One of the key things is that this is mirrored. There's two data centers, which gives us a sort of live disaster recovery scenario At any given time we can move traffic between one data center and the other and that's been key to The success of the product from a from an enterprise perspective. It brings immediate value To the company from a financial perspective I'm not going to talk about this But there are people you can ask around if anyone's interested in this it is it is you can look at look it up And I will be uploading the slides as a PDF As soon as I get the instructions on how to Right We first built our open-stack cloud of a small well the number of third-party consultants Kagan and helped us and we use a distribution And Red Hat sent some consultants in to to help us set it up Over time we ended up building a small team of open-stack Aware people some some were graduates some are experts we brought in externally and You will see a number of presentations actually historically over over the various open-stack conferences There's one about a trilogy comparing the cost of an open-stack cloud with an AWS cloud run by OVH and a couple of other guys. It's really it's worth a look It's good value, but they all use a team of six people 24-7 maintaining their their open-stack cloud and this is this is how we originally did it. We had the Probably I mean six is a number. We had six headcount There was never actually six people at any given time in that team that we're building it and what we found happened was this They it's possible to build a cloud as soon as you have to maintain that 24-7 with 1,000 hypervisors in two data centers and all the other complexity We showed on that chart is that the small issues that the hardware Issues that that invariably happen become an enormous drain on people's time and what we wanted from from from that team Was for them to drive forward to look at how we would upgrade How we would improve things what other things we could do what other projects we could make use of rather than the ones We just started with we went to continue exploring what opensack could bring and while they were Drowning under a sea of tickets that that was never going to be possible and I've got a Rough visualization of our pager duty calls over time and you can see it at one stage. We were Answering the phone the entire time of late. It has been significantly better And and that's been down to the approach we took of trying to separate this the responsibilities Now it's not been it's not been without its challenges and I think agent will talk to you about some of them But but what essentially we did in summary Was that we created we split two teams and we asked the guys who had traditionally been running the The legacy infrastructure. I think they've been variously branded as IT platforms and infrastructure engineering various other other names over time We asked them to come on board and help us out and I think I was going to talk to you about The details of that at the quote from devil's borat is a nice one that you get lots of teams of one But the the key thing of a team is that we work together and across teams that can be tricky Yeah, so to make this a success. We've implemented the single ways of a new ways of working in our infrastructure by doing sprints and One of my guys for example is working really close with Thomas teams in order to make this happen Obviously another few guys will join their sprints in order to bring all the knowledge and all the operational stuff into our area to Give them time to focus on on continue to continue the project and give us enough time and knowledge to continue the operational work and That's been said. I'm going to present Single pane of glass in terms of storage what we have behind us in the open stack and how we monitor the storage behind the open stack and Again, this is a single pane of glass. We have like multiple devices Why we choose to have this model because It's really easy to operate with them. It's really easy to maintain them. It's really easy to upgrade them and Also, we segregate environments production from non-production Zones We also implemented synchronization between devices So if you want to migrate from a device to another obviously it's it's straightforward to the synchronization at the volume level and Yeah, they are mapped through Cinder to our VMs in open stack. So This is our single pane of glass in terms of storage So having having described a single pane of glass in terms of storage one thing we don't have and I would recommend that anyone That's running their own cloud does have Is is an overview of what your consumption rate is what your capacity is what your usage is and that's something that we Haven't had and we still don't have I'm going to talk very briefly about Solution we've put in place But as I go through it I think you'll see that it's not optimal and as as we start looking at including other clouds into our tooling and The supply that we offer I think it's going to be increasingly important that we not and Up with a separate way of managing operating reporting on each of the clouds that we use which is the way the model is currently looking What we have here is a rough little design of how we have Pulled out information of our consumption of our open-stack cloud in order to make a visible dashboard for people to Monitor and look at this is essentially for our 24-7 operations teams But also for those looking to see and make purchasing decisions about further hardware that we want to integrate and add into our cloud Essentially it pulls out what it's based on an answerable module written in Python It uses the open-stack SDK library and it hits the The under cloud and the over cloud to pull out information. It then ships that information in a predefined format adjacent format To Splunk Which is a log monitoring solution. This is an interesting concept but it can't Splunk does have reasonable dashboarding capabilities and if we get time well, um, we'll touch on them in a second, but this runs on a 24-hour session. This isn't live monitoring of your usage. This is simply What is allocated in terms of resources in your cloud so that we can have a little bit of understanding of what's going to happen next because With the model the CICD model we have and the independence we give our software developers They could at any time simply configure an extra hundred VMs on their on their infrastructure for their application and Try and deploy it and the last thing you want is a broken deployment midway through because you don't have enough compute to offer The tooling which has happened So that's not meant to be legible I can see people squinting if we get time at the end I can open up the dashboards and just go through exactly what they're showing you from our two different clouds or different data centers And so in terms of the visibility that it gives us but but to repeat this isn't This is a stop-gap solution for us and the fact that OpenState gives you these API's that you can troll and do what you want With helps us out of a hole But I would strongly recommend you go and look at the various sets of tooling that will give you Visibility into what you're doing and where you're going I think we approached it as an afterthought and that was probably a mistake And there's also a blog post and I might share afterwards about how we've done build that lightweight reporting platform so we I think this is gonna be for you Adrian, right? The infrastructure is across two data centers and we we've built everything in a with an immutable design So when we do our deployments, we rebuild everything from scratch and this this hasn't been without its challenges Nor has running a physical infrastructure Yeah, and yeah, we we we had a lot of issues. I need to be honest with you MLug issues storage expansion firmware patching and Thomas mentioned that we have an active active data center. So we could Fix this in multiple ways by moving the traffic in a single data center and replace all the hardware or all the issues That we had in a different in the second one But again, because we were just before the Cheltenham and Grand National this year before the spring racing We couldn't do that and we did several things for example for the battery replacement We decided to do that in batches and Yeah, we have immutable environment as Thomas mentioned and in terms of so we keep the VMs on the local storage on the on the Hypervisors and in case of a failure we give to the developers spare hypervisors, you know In order to deploy the TLase or their micro services onto them Yeah, we also had like CPU spikes on our hypervisors. We raised cases with the guys from Red Hat They gave us a lot of hints and they help us a lot in there Yeah Yeah, also we tackled down the meltdown inspector vulnerable vulnerabilities by applying all the patches Through all our infrastructure And as I said, we can do it like moving the traffic in a single data center apply everything In the second one and move the traffic back and take care of the other one. So We can do in multiple ways I'm going to talk about the migration from kilo to neutron which is ongoing at the moment together with the guys from Thomas team so the process is Extremely simple. We take the hardware from OSP 7 we review the hardware Kilo, sorry not always be seven Yeah, so we take the hardware from from kilo we review the hardware We put it in in newton and Using the pipelines Everything is done actually through the pipelines. We have like six steps to do that to cover and the pipeline was written by project project team How we do that? So we create the go structure Create the go structure by down downloading all the requirements Preparing the migration the second step is to create virtual environments because we we do have a lot of Python modules and Ansible playbooks We need to create those environments The third step is to clean up stage in this step. We clean clean up the licenses and everything. Oh, sorry We clean up the we clean up the knowledge licenses and all the monitoring from Kilo and we prepare actually we in the next step we load the profile Which means that we have two types of profiles application and database and it depends which type of migration is if it's a database, obviously, it will be a profile of database and Yeah, we take the next step which is then in preparation and the final stages to have that specific hypervisor migrated in the New world which is newton Okay In terms of operational excellence The guys integrated like everything with slack page of duty We use sensor for monitoring the our infrastructure and again with page of duty as I mentioned and I can do a short demo Just to show you exactly Sorry So what are we demoing here, right? So I'm going to simulate one of the hypervisor failure So I'm going to reboot one of the hypervisors Is this a production hypervisor? Yeah We don't have a change Thomas Okay, so that hypervisor is down now. So we have to wait like another one minute and a half In order to receive a call in page of duty Um Meanwhile, we can I don't take any questions if you want or I think one of the things One of the things that we skipped over earlier was that the that cd go pipeline that you saw Adrienne Talking about is almost identical to the ones that we use for deploying the applications that software developers use So there are additional different steps, but the interface is the same and the level of automation is by design and we We want to interact with our infrastructure via git and Configuration changes and then triggering pipelines that apply them that the repeatability the immutability They're a key for us There should be no There should be no pets. It's catalyzed by design and we went for a page of duty call. Yeah I'm never anticipated a page of duty call with so much So yeah, the server should go down in a bit. I can do a hard stop anyway Stay waiting for page of duty call me Let's give it a minute and skip it if we need to upset the demo Just receive the Okay, so we have received the yeah, we see receive the alert and slack as you can see in here for that specific server And I'm waiting for page of duty to call me in a bit, okay Did you hang up? Yeah, so I acknowledge the Check and obviously now what I can do I can check exactly what instances do we have running on this specific hypervisor to make the developers aware that the VMs are Not running anymore and what the decision we should take so if we have like if we have a Hypervisor failure and we need to replace the hypervisor Just take one of from the spares. We give them the spare They are allowed to deploy any time their microservices in production and we can take these hypervisors that failed and I know investigate replace the failed parts engage the guys from Hi, Andre So we can take we can we can engage the guys from data center in order to replace that those failed parts So Adrian if if I'm a software developer and my VM has suddenly disappeared because of this hardware failure Have I also received a page duty call? Yeah, you receive the page of duty called your your notice and also your You should be actually on our select channel together with us and we will continue the investigation and We will decide together if we need to give you a spare one or not Okay, in this instance, we see Andre has just joined the the set channel. Are you able and he needs? He needs a hyper he needs the hypervisor back. Yeah, it would be the process for getting Andre his hypervisor back So there's a pipeline in place. So right once we take this out from the from the from the Production environment we give him a spare we tell him exactly which is the new hypervisor and he's able actually to he will be able actually to deploy his Tla's onto onto that hypervisor and all of this would be via the go CD pipelines that you've showed us earlier Yeah, everything will be done through The ICD that we have in place For the demo. Are you going to do any of that? Yeah, I'm going to log into Our open stack and check exactly what VMs do we have what instances do we have on that specific hypervisor? So you will see all the instances instances there are shut off actually on this specific hypervisor as I mentioned the instances as hope are hosted on locally on the hypervisor and again by giving them a spare hypervisor, they will be able to destroy these VMs and Create a new ones on the new hypervisor. Thank you We're a little bit ahead of time, but I Think is there a Q&A slide that we can put up In the hope of prompting some questions Any questions about how we're doing the day-to-day managing of our cloud? so So the the question is how do we do the upgrade from? Kilo to Newton or OSP 7 to SP 10 so We've actually leveraged the the immutable model that I described and we have I Think we touched on it with it with the pipeline that you saw there But effectively we've built from scratch each tenant in OSP 10 and then we have asked the Software development teams to change one entry in their yaml file Which is the name of the cloud so it goes from being OSP 7 to SP 10 and then they redeploy and it's we have a 30-day redeploy policy So in order to avoid needing to do patches of OS is we ask them to redeploy every 30 days Which means they pick up a patched image every 30 days We've actually had to accelerate that somewhat because you can see it's quite a wasteful use of hardware to have everything running in duplicate At that level, but we have then gone out to them and said could you please in the next week redeploy all your applications? It's been extremely seamless. They have the applications have spun up in in the second cloud It does mean that we were effectively running two clouds until we managed to get everyone to go through that process But it so it hasn't been in place and it's been It's been a redeploy from scratch but the config means that I'll get config is entirely accurate and I think OVH at one of the keystones Keynotes were saying that they'd spent a Year doing R&D and then four hours doing their upgrade ours has been similar There's been about a year design testing And then it's not been four hours because it isn't us effectively doing the upgrade It is it is the teams and we have this Tetris of bringing hypervisors in and out of one cloud But it but it has been reasonably seamless a few permissions differences and a few extra bits we did like changing the LDAC groups which have caused people some issues but the actual Keynotes and newton upgrade has been surprisingly issue-free anybody else the question was do the developers have control of the compute nodes on which the applicators are running because we Described it as them redeploying on to the spare node Yes, and no, I think slightly led you astray with the with by saying they deploy on to the spare node They would redeploy their application and if it was a pooled tenants then they would end up some of them would if Nova's scheduling As expected they would end up on the spare node some are pinned some some to avoid CPU contention and noisy neighbors, etc. Some have pinned onto specific Hypervisors, and then we would tell them what the new one is and they would change their yaml to pin to the new one and there And then redeploy often those are the heavier more stateful applications anybody else I'm Wondering on the single pane of last few for your monitoring How do you determine like what are those things that you want to monitor? Because we also are in a similar situation We kind of wonder like what are the most important metrics that you're watching and kind monitoring and how you how you set up that single Yeah, so how did we define what's in the single pane of glass view? Time and experience people asking us questions that we couldn't answer Effectively has been has been that so we as people have come to us and said I'm running out of of compute in my tenant. Can I have gonna have more hypervisors? we've also I think I we showed the splunk based ones, but there are also we have a TSTB instance Recording actual usage both at the VM and the hypervisor level so we've been trying to use that to first of all look up How allocated the tenants are we get that from From an open stack and into Splunk, but then we go back and we say well actually you're you know You claim to be having a three to one contention ratio But you're only ever using 10% of the CPU and the hypervisor Can we look at how you've configured your application in an attempt to you know I mean minimizing cost is in the enterprise world a key thing So there is that in the storage as well. We found that we had Given the developers free reign and they had over committed the storage to quite a terrifying ratio And we had to go back and work out what they were actually using because I've done this myself as a software developer If someone gives you free reign on how you specify your infrastructure You are gonna bulletproof it because you don't want to be phoned on a Saturday And you don't want your customers to be impacted so there is a balance there, but we didn't have a methodology We've rather done it by trial and error Yeah, from the storage perspective to protect ourselves obviously we added QoS and also Yeah to to I don't know in order to protect as I mentioned ourselves from from from this capacity perspective We decided to split those in multiple and we have the view exactly what's running on each device As I showed you in Anybody else in which case, thank you very much for your time