 Hello guys. Hi. Good morning So my name is Vilenath Majof and I'm Jan, Vigara So today we'll be talking about immutable open-strike infrastructure and how we approach the ad party power bed fair So first off just a bit of background about us. So what makes us different we merge a party power in bed fair merged 2016 to form a 42100 company And we have offices around the globe. So we have offices in the UK, Romania, Portugal Ireland, Malta, Gibraltar, USA and Australia as well And if you're curious about the stuff we're doing we also have an engineering blog which you can have a look at betsandbets.com And in our company we have Over 1,000 engineers Some of our products are we run a betting exchange. We have a sports book and we also offer games in retail And just to give you some numbers of the number of transactions we do daily We have 135 million daily transactions going through our systems and we do 30 billion daily API calls as well In terms and all of this obviously as you can imagine generates quite a lot of logs so we do 2.5 terabytes of daily logs and We have 120,000 monitoring points every second generate from all those servers And our transaction times are within four milliseconds We are also building a hundred thousand core OpenStack cloud And we have two two two petabytes of storage So this is our OpenStack journey We had a four-week pilot And back in 2015 Then we had a six-month Sorry for we peel a proof of concept and we had a six-month pilot living out active active data centers Where we onboarded to customer-facing applications And this also allowed us to test our self-service workflows, which we are gonna see in a bit We close the pilot after those first two applications and so far we've upgraded over 100 applications so far We also upgraded to new ash 3.2 r10 and we're We also built our test lab. So what we're currently working on is doing our new agent OpenStack upgrade All right. So how many of you have already upgraded their OpenStack infrastructure? Yeah, how easy was it? Yeah, so we're using Red Hat for OpenStack and if we want to upgrade from Open OSPD 7 to OSPD 10 we have a couple of steps So we had two different solutions for that first one is to follow the the production so Them the documentation so the commutation says upgrade the director to a yum update then upgrade the over cloud image upload them to your glance and Then you can start upgrading your over clouds But first step is you need to modify your heat templates for that and this is where things start going wrong Because we're using a spine and leaf architecture, which means we have very specific and very customized heat templates and which means that upgrading will mean a lot of testing a lot of refactoring and Things have evolved quite a lot between the two the two versions and that's probably as we Trying to move from OSP 7 to SP 10. There's a huge gap between the two so the other solution would be to deploy a brand new OpenStack and migrates all the VMs from the old one to the the new one so which is the solution we we choose and We're going to explain to you how we we are going to achieve that So first here is the reference architecture. We we're on so So sitting at the top, we have a global old balancer the diverting traffic to our two data centers After that, we have junior performance and citric next-get a little balancers And then we have the new wash gateways Directing driving to our clouds and then we have our red hat open stack implementation sitting on KVM hypervisors And we also have a risk of top-of-rack switches and some plainly bird hypervisor some plainly supervisors for our common infrastructure like LDAP DNS and NTP and This is replicated across the two data centers and we use net up and pure storage for storage And this is some of the tooling we're using for our continuous delivery to chain So we use thought works go for scheduling and visualization of the pipelines and we use Jenkins like most of our people We use good lab for our for our version control and we use chef and Ansible for configuration management and orchestration We use artifact tray For our artifacts and RPMs and we use qualists for security scanning Okay, so if you want to do in multiple infrastructure and in the case of open stack You probably need to start from really at the bottom. So how you provision your infrastructure? How you do all of this? So let's start with how we decided to provision the the network and just as a quick review Presentation how we use spine and leaf. So is anybody familiar with the spine and leaf architecture? Okay, a couple of you. It's perfect. I won't have to explain too much So basically the idea is you split the traffic on your You have the spines and then you have the leaves connected to the spines and each of them are in the different billi-p routing Which simplify and prevents all the usual truck traffic you can have in in between racks. So from this architecture we have this rack diagram and What we have is we have two spines in each racks One management switch for where we're connecting the ILO and one management switch for the rest of the for the provisioning of the servers so we started working on a Announceable playbook that is going to provision the switches and Making the connection between all the spine and leaves and reconfigure the BGP reconfigure everything So we started by defining the inventory file which represent what we have in the in the rack if you have a look and Based on this inventory file We know where the switch is located and from the switch the location of the switch were able to make some clever Connection between the the two and able to reconfigure the whole network each time So the process we use is pretty easy when we provision a new rack We spin up the switch They pick seaboot pick up the configuration through their TPS and they are they we push the minimal configuration Do some firmware upgrade and the switch is ready and then we run an answerable playbook that will reconfigure everything So as an example we have in that case Just a template that is reconfiguring the BGP between the spine and the leaves and making sure the connection is is done so for our SDN we use new hash networks and I'm gonna cover how we consume new hash for our SDN So we wanted to make the process for the developers as easy as possible and also sell service So we have designed a set of YAML config files, which they need to fill out and consume their network that way So those files look like this and they have a subnet YAML Which specifies the domain name where the application should be deployed to do and they also specify How many the maximum number of instances they might need for the application? Because developers don't really care about subnet sizes. They care about how many VMs they want to have and This just allow us to make it very easy for them saying okay I need X amount of VMs and I just put a number and I don't care about the rest Which makes a nice and simple and we also have another file with the security ACLs So in this example, we can see that we have an ingress ACL By the way, this is from the point of view of the VM because it's very We try to make the self-service workflow very developer focused So we're trying to hide the network complexity from them But just make it easy for them to reason about it. So ingress is going is flowing into the VM not leaving not into the network so here we have a fairly simple ACL which is incoming on port 8080 and Next to it we have an egress ACL which is to connect to external databases such as my SQL So how this maps into new ash up into the new ash object model? So we're using the same configuration we saw before we see that we've ended up with a zone for every application and Under the zone isn't is the red one? Another each zone. We have the subnet which we've allocated for this application in this case. It's a slash 26 and They are on the right side of that. We can see the virtual machine instances Some have one some have two. This is a queue environment. So the scare was fairly small and The way we do our ACLs is we have the security policies attached to at the zone level So all the VMs on the rest and as a result share the ACLs and you can see how they look like in new ash Okay, so now we have the network setup next step will be to provision the physical infrastructure. So how we do this? Again, if you have a look at our rack design. So as I was mentioning Each rack has its own set of network and IP addresses. So in our case, we are just going to Provision the servers and you have at the top the IP addresses. We're using for the ILO Of the server. So we translate this still in the same inventory file. We are adding the servers we want to provision and We start adding a couple of informations like the ILO address We're going to use the unit number where the server is located and based from that The process we are going through is we try to reconfigure will reconfigure the network so again based on some convention and some templates were able to Create the configuration push the configuration to the switch to reconfigure the port channel reconfigure The network provisioning the all the information we need Then the next step is to create the DNS entry change some ILO parameters power of the server at the server to the HP one view to manage it and to apply the configuration we want to apply on the server and Then the server is ready. So here is some example of how we configure the port channel. So based on some Information so the inventory file are sure is really simplified. There's more data than that. But based on that we are extracting some informations and from this inventory file and the rack where the server is located and the Switch is connected. We are able to push the configuration to the correct switches And we're able to create the port channel push the VLANs. We want to use and things like this so then here is some of the sample of our playbook that is Provisioning the server so on the left side you have some of the thing we wrote specifically for ILO and for changing the ILO parameters wedding resetting the host and then Creating the recording in for blocks and then after that it's powering off the server and adding it to HP one view so after a couple of minutes when the playbook is running or after a couple of hours when the firmware has been updated on the servers we can see that it's Finally in HP one view And now that we have the hypervisors. Let's see how we actually use them So we have created a self-service workflow for VM creation and The way the developers consume this is by filling out a fairly simple ansible inventory file So how many of you are familiar with Ansible? Okay, quite a few good, so it should be fairly familiar with this so um we First of all we have a naming standard for our instances Which is data center replication the host the number and then the environment it is in We also define the number of vCPUs RAM disk and image we want to use for each instance We also and in the placement prefix and the host we define what hypervisors we want those instances to land on The reason we allocate hypervisors rather than have a shared pool is because we wanted to avoid the noise neighbor problem and We are and we also specify what application want to deploy on each instance in this case. It's our app So how how do we actually deploy all of this? So we have a go folks go pipeline and the first stage of it is just to pull down all of those config files Ansible playbooks onto the go agent We then set up the prerequisites an open stack this includes creating the flavor and the host aggregates We then check the capacity on the hypervisor just to make sure we have enough capacity on the hypervisor for the deployment We're about to do and if we don't have enough will obviously break the pipeline We then create the layer 3 network So this will create the new ash subnet and also the corresponding entity and open stack and create a sales So we do a B subnet so this will be the a summit in this case because we have nothing in our cloud in this instance The next step would be launching the VMs. This is consuming the static inventory file. We saw earlier Our next step in the pipeline is run ansible So this will apply our ansible role on top of those VMs and Get the application ready to be put in production The next step is create web. So this will configure our net scalers and the services on them Rolling update is where it gets interesting. So this will prepare the application in my great any state you've needed and Then it will put in service At which point we also Test the application just to make sure it's so good and we promote it in Jenkins ready to be pushed onto the next environment And then we obviously clean up the previous version which we don't have one because it's a fresh deployment So we use the same process from QA To integration to perf and then to production. We used exactly the same process just on a different scale at each environment Okay, so we thought that could be a good idea to use the same pipeline and to Find a way to deploy the open stack director And so that's what we started doing. So Instead of using open stack to deploy the application We deploy with speedy directly on a little bit server, but it's going through exactly the same the same process So a couple of things for the one you want to provision An open stack infrastructure. So you need a couple of land and VLAN to be created So the first one is the one at the top, which is the VLAN for the provisioning Where this is where the server is going to pixie boots and then connect to the Ironic sitting on top of on in OSPD will pixie boots get the configuration and be deployed and then you have the two other Network one for the internal API and one for the external API where we can query everything. So and we have then two different networks for each Hypervisor one which is connected to the data network and the other one which is connected to the new hash network where we will provision the the Network the subnet at the top we have also the VLANs for the The ILO and if you have a look each rack again has its own sets of IP addresses for the ILO So all those IP address all this network We try to convert them so we wrote a playbook that deployed the OSPD and We just converted all the informations you could find on the previous slide into Actual data. So here is the YAML file we're using and we're filling it to our Ansible playbook and from This data so where we have the VLANs we want to provision for the data storage all the internal APIs With the VLANs where it sits on the interface and all those those information Plus for each rack. What is the network? We are going to use so each rack again in the spine and leaf architecture has its own set of IPs to prevent Problem So from those information we create some templates which are going to provision first the The over cloud configuration. So this is the configuration that is used by RDO that will create all the the other other things. So we try to use convention as much as possible and As part of the convention if you have a look we use the site a cider IP range and from that we just extrapolate and extract some information So we get the first IP address available. We get ranges based from the those information So that's the first step that is creating and provisioning the the director Then we have our very specific custom Triple-O template. So this is the one that generates from the rack We go back from the rack we define here so we can specify how many racks we want to deploy open stack on top and From this we extrapolate and we generate the number. So if you have a look at the thing at the top and at the bottom we have We generate each rack and each configuration for each rack. So That's how we do this and then we have the last step Which is creating the custom file where we have all the the custom racks and how many server we put in each rack Once we have everything we put this through the pipeline and Everything is green go through exactly the same pipeline and we have same thing through different environments We are able to test and make sure that before reaching the product environment. We tested everything Okay, so at the end of this process We should have an open stack director ready and we are able to start scanning out So for us getting out will mean adding new servers. So Again, we have automated the process using ansible and we still using the same inventory file so where we describe the entire infrastructure and This is where we start adding a bit more intelligence into the Inventory file. So if you have a look on the left of our inventory file, we have the cloud we are going to use and We use that inside our ansible playbook to match what you have on the right-hand side Which is the cloud configuration? So this is the OS client config which is then used by shade and by ansible to targets Which cloud we are going to to put things so then the process is pretty simple Pretty simple. We reconfigure the switches to push the villain. So that's exactly the same step We went through at the beginning We get the note from ironic to make sure that the node is not already there and then gather some I lofax for Memory CPUs this kind of information then we add the note to ironics. So the OS PD of ironic Set the node in maintenance mode introspect the node to get some information from the node itself and then exit the maintenance and we update the template to say We have X amounts of servers So the this is the playbook we were using so again, we gather facts from the I lof And we know we try to figure out if the server is a new one or an existing one And then based on that we add the node to ironic So we modified tiny bit the OS ironic module to add a couple of more informations like we are pushing the rack and the unit and some other informations like the profile This type of things because later on we're going to use this and I will explain how and in the result you have Inside your OS PD you have if you do a node ironic node list. This is the result from ironic and You can see that some of the information we pass at the big in the inventory file or pushed directly to the ironic node Okay, so we have the first one we know how to enroll servers We know how to do a scallops doing scallops is pretty easy using hits and scanning out to the infrastructure Next step is how do we create a second instance? So that's pretty easy We if you remember our network We just duplicate it so copy and paste we change a couple of IP addresses a couple of villains So instead of using as we had a previously villain, which were one hundred two hundred four hundreds we changed that to use one hundred the one one hundred or two one four hundred one and And we change also a couple of IP addresses to completely isolate everything Same thing for the racks we want to isolate the racks and make sure they are not on the same same one and then We reprovision the new OS PD. So at the same time We will have two OS open stack director running one for the OS P 10 7 and the other one for the OS P 10 then How do we move servers from one Open stack director to the other one. So from one instance to the other one If you have a look at the diagram So we've we are decided to go down the path of one rack is dedicated for one cloud or at the beginning we have all the racks in the same cloud and Over the time we're going to provision new racks and migrate the VMs to the new racks and we'll explain how we do do this so in our example This is if you remember the inventory file, we have all the clouds on the left, which is I to lab OS P 7 and if we want to create more Entries or specify which node or which rack is going to be in which cloud This is how we do this. So we specify the clouds We are going to use and this is going to match again the cloud configuration and we target the two defense OSP directors So the process is exactly the same we enroll the server in the new cloud and Then how do we migrate up before enrolling? We need to delete them from the old cloud. So we it's still working progress We're still working on that. But the idea is to as you remember in the Ironic nodes, we know which rack is there So we are able to do some dynamic query or dynamic inventory to gather all the informations filter that by rack And then destroy all the machines inside this rock add them to the new one and Then scale out the the exit the new run So this is how we This is the reference architecture we have at the end. So we have at the same time two different open stack running and now this is the trick and So now that we have our new cloud Radiant waiting, but somewhat empty. How do we actually consume it? So To do that we wrote a custom ansible dynamic inventory for open stack which allows us to query multiple clouds So again using OS client config we have our clouds in a in a fairly simple demo file So in this case, we have our OSP 7 and OSP 10 named clouds and This all the same config also allows us to have many other clouds. For example, if you want to do the other data center So they cannot be within the same config file So we to how do you actually specify which one is the old cloud and which one is the new cloud? So it's fairly simple. We just pass OS clouds Sorry environment variables to specify that the old cloud is OSP 7 and then the new cloud is OSP 10 So we run the normal dynamic inventory to get to gather a list of hosts and then we combine them into a single unified Inventory list, but each instance is also tagged with the cloud it lives in if we need to use that later and this is the playbook to actually Allow us to reason over the clouds we have Because we can name the clouds whatever we want, but we wanted to use the same playbooks for All the clouds we have going forward. We didn't want to hard-code the values of those clouds. So we Format we rewrite the name of the cloud dynamically to old cloud and new cloud allowing us to easily reason over them So this is an example of how we would create an instance into our new cloud. I Won't go over the ansible, but it's fairly simple and Again to delete from the old cloud, which we no longer need we would do pretty much the same process But with state absent targeting the old cloud and the old parameter for OS server So and this is the theory so far. So how do we actually do it? So if we go back to this now that we have our OSP 10 cloud adjacent to our OSP 7 Again, we go through set of prerequisites. This pulls down the config We create our flavors and host aggregates into the new cloud We check capacity should be empty. So should be fine, right? We create our B network as an entity under the new cloud, but also linked into new ash Which is a shared resource for both of them We launch our VMs into the new cloud Again, we apply our ansible row on top We create the Vip. Well, the Vip already exists. So Ansible will say okay moving on And rolling update So we put our new application service and take the old ones of the low balancer And then we make sure everything is okay And then we clean down the old instances from the old cloud So when we do this for all of our applications, we know that this hypervisor is ready to be recycled and put into the new cloud and Yep, we reclaim the hypervisor Okay, so we've done our upgrades and we can go grab a coffee but the trick is we As everything is automated that's pretty easy to do but How we make sure that VMs are going to be migrated from one cloud to another We have a strict policy Paddy power bit fair where we say a VM cannot last longer than 30 days for security reason So the idea is Developer will be will have to redeploy the application in the new in the new environment But they don't know that that's the new or the old one They don't see that we just allocate them a new hypervisor run the pipeline and it's deployed on the new on the new environment So theoretically within the next 30 60 days we should be able to migrate the things seamlessly without any downtime and We have fixed our problem Do you have any questions for us? We have a bit of time Yeah, yeah, I think so There's a mic for your place. What size is your clouds? I mean how many nodes is director able to handle? The moment, you know, yeah, we have around 650 nodes in two data centers So 650 hypervisors and we are on regular basis provisioning new new nodes Yeah, so it's six fifty divided by two. Yeah, so it's three three twenty five per data center at the moment But so our end state is one thousand three hundred. So we're gonna have six hundred fifty for each data center You didn't anything about storage Propagation, could you could you tell us what kind of storage you use and how do you automate it? Yeah, so we we're using pure storage and net app and we're using so Cinder for the that and we have a presentation later today On how we do this exactly But the process is exactly the same we create the Cinder volume attach it to the VM And once we provision the new one We re-import the volume into the new cloud and reattach the volume to the new to the new VM That's really a short explanation Because there are a shared resource within the cloud it allows us to do that the same with low balancers storage and SDN Organizationally what was your approach top-down or bottom-up when application to infrastructure infrastructure application to encourage this CICD mentality is true to start with from an application perspective. It was told down Any other questions? Okay Thank you very much for your time