 Good to go Okay, welcome everybody welcome this morning We're going to present you about how we decided to Architecture the control plane of our open stack installations We took a little bit different approach than what most of the bigger referential architectures are showing Nevertheless, it's it's quite similar to a few of the other control planes that you might have seen already being presented At this summit. So my name is Marcel Harry. I work for Swisscom. That's another picture of me, but It's familiar to me I'm leading the architecture of Swisscom's elastic stack. This is our stack based on open stack and Our platform as a service solution, which is based on Cloud Foundry. I'm a member of Cloud Foundry's technical advisory board Swisscom is also part of Cloud Foundry foundation. I have a Background in automating all the things like various automation tools and Linux system engineering So I am Alberto Garcia. I work for RECHA as cloud architect I joined RECHA kind of one year and a half ago But I've been working with emerging technologies for the last five years and in a previous life I was a network engineer so I have a good background on the work stuff as well Yeah, and Alberto was on the side with us and that's how it So everything started If we look a little bit about Why did we have To build our open stack and then also reason a little bit about the motivation behind our decisions Is maybe let's have a first look at the use cases One of the very early drivers was deploying Cloud Foundry on top of open stack Cloud Foundry is a very Distributed system really built to run in a cloud native fashion and it provides a platform to run cloud native applications So there is one offering that we have on developer Swisscom.com. It's our public platform as a service offering. Everybody can go there sign up and Start pushing applications however, we're also using this platform to develop new services on top one of the services that was then start started being built and which was also a huge driver in the requirements behind the platform that we build is the product my cloud it's It's a cloud storage solution being provided to all our residential customers as part of their plans and It provides you with official marketing term infinite storage so If we look a little bit of how Swisscom Looks at IT and how we would like to produce it. It's We think You should we believe very strongly in a for our CICD approach. So Meaning that we automate a lot of the things Or let's say all the things We also have multiple stages where we then test The new releases that come out either whether it's new software releases, but also new automation releases This gives us a way to a have rapid release cycles So we can iterate quickly on new features and box fixing box and But it also gives us like Really confidence in what we're pushing because things are going to be tested up front and One one of the things is where we see cloud foundry and open stack be within our IT environment It's it we really see it as the platform to build the next generation workload At Swisscom so it's the platform for cloud native applications highly automated microservice like architecture and So when when we with that in mind, that's also how we looked at When deploying open stack so What we do we usually see in the open start deployment. We see a Monolithic controller cluster with all the control play running on top of a pacemaker cluster fixed number of Bermetal nodes But we were aiming for something different something closer that we that we we see in Cloud our workloads something agile something dynamic a platform that can be plugged into the CI CD pipeline Something that we can operate with modern Operation Methodologies, but we ask ourselves we we don't see this usually in production environments Is it possible? Is it doable? It's open stuff prepared for that. Is there a model prepared for that? So let's go through the analysis with it To figure it out. So the open stack contemplate open stack is a fully distributed system It keeps services as the couple as possible So you think about it how open stack services integrate with each other if they use the RPC Which is the message through the message bars. Did you say vehicles? They use the database? These are well-known mechanisms to the couple modern applications. So Yeah, it allows dynamic topologies So if you think what you it does not matter if you have one API 100 API's the communication is through it through a lot Balances so you're pointing a VIP so it does not matter and the coordination of the non API Components is drawn is done through the RPC through RPC calls another RPC cast So you can have multiple components one single one. It does not matter. So Yeah, control plane services can virtualize so you think about it. It does not demand Kind of How to phrase it it does not demand huge resources So you can deploy as in the API with just one virtual CPU and two hicks of memory You you don't have it's open source You don't have this kind of direct binding with licensing between the service and the and the server where you are great So it's a good candidate to be to be virtualized And you have dedicated projects to automate the deployment of the other services So that means that the open stack control plane can be managed as a code and you can pluck it You can pluck it into the CICD. So given that Open stack has been developed following the modern application methodologies. So if this is exactly what we were looking for So, yeah, so they change model All of you know the pacemaker h-model is the one that is documented in in the open stack high availability guy Is it consists of usually a three? Bermet are no pacemaker cluster with all the control plane services running on on top It's something that it has been proven in Prolation You know that it works, but we found some limitations that we did we didn't want to include in our architecture So this deployment has all the services in the same cluster So you cannot scale it as it is you have rabbit and Q you have Galera you have MongoDB those services do replication So if you put more more more more more servers, you will have more more more more replication overhead and less performance so if you have if you have all the services running so you have dozens dozens of services running on the same note There the resource requirements get higher. Isn't it? so if you translate these requirements to Virtual hardware requirements, you will end up in huge Virtual machines that are not good for virtual environments because they are harder to operate You don't want to have a virtual machine that has the same number of virtual CPUs on your than your hypervisor. Isn't it so? Life cycle of a metal is a is it low So you can it's not a good model to have in a virtual environment So you have to to move to where metal and some companies need months to have Hardware ready for consumption. So it's not something if you tomorrow need to need to scale one controller node You won't get a new hardware rack connected to the network and so on in hours This is not going to happen. So the CI CD is more complex You had to deal with pacemaker concepts when you are interrating in you in in individual components You have to care about the constraints I cannot I cannot restore Kiston service because I have some constraints And I'm going to restart also seen there and going to restart also Nova. So it's something that can be automated but it's a bit harder that if you don't use pacemaker and The whole point clustering software is a stateful. This is the whole point The cluster software is taking care about the status of the members of the cluster Taking care of the status of the services that are running on the cluster We were aiming for something more established something that follows these mother principles and It binds to the control plane. Okay, if you deploy it in bare metal, it's clear that is that early It has a direct the dependency with the with underlying infrastructure But even if you deploy it in a virtual machine, you have these stony resources To the to restart the nodes in case of failure that requires that you configure a user a password in the virtual environment Direct access to the API of the virtual environment So you you have this direct binding you you cannot port this application to other virtual environment without changing configuration and so on So yeah, not the best h.a. Model for what we were aiming for so we look for another options And looking for for alternatives we found the so-called h.a. Proxy keep a lady model our colleague have a penia that is Engineer at the at record has documented this architecture in this link that you can see there this gtap Document is it based on the belief, but that you don't need a Clustering software taking care of the opposite services. So it's a pacemaker free architecture Thanks to that you can distribute the components because you don't you don't have this glue that is okay That is a cluster you don't have this glue. You don't have to yet you are not Kind of orchestrating the startup of the services based on constraints So you can just split out the control plane in different in different smaller roles And as you can split the control plane in the smaller roles Virtualization makes more sense because you can create a small roles with a small hardware resources requirement that can fit well in any virtual environment and of course It does not bind application to infrastructure because you can virtualize it and you don't have any interaction between the application layer and the underlying hardware So is it doable so we we looked at it and tried to design it a little bit So what we aimed for was a distributed and virtualized control plane Meaning that we can pull the pieces apart as Alberto explained into individual components We wanted to look at the individual components as Horizontal scalable services wherever possible. We will come back to that later As well as so the foundation of that to iterate quickly is definitely a virtualized control plane because there we can iterate way faster and It was kind of clear that we wanted to isolate shared state meaning My sequel database within a Galera cluster and rapid MQ on their own So Double HA we leveraged the the virtual HAA the virtual machine HAA capabilities of the virtual infrastructure So if a hypervisor where your virtual machine is running goes down the virtual machine gets Automatically migrated to a healthy hypervisor. So hardware failures Handled and also we configure an infinity rule So we ensure that the same that virtual machines that belong to the same role don't run on the same hypervisor So it's what hypervisor goes down. You don't you don't lose a complete service and the Application level HAA is something that looks familiar for everyone because it's more or less the same principle of pacemaker But without pacemaker. So what we don't have here is that if a service goes down The service will be down till something goes there and start at the service. It can be an operator. It can be a automated Automated code or whatever. So for the web-based services API is the dashboard We use HAProxy HAProxy distributes the loads and plus Thus monitoring to the back ends. So if a back end stop responding to the monitors, it just dropped down the server from the back end We use keep a like the because we don't have a clustering software that handles the virtual IPs. So we need something to to provide H8 to the virtual IP. So we use keep a like the keep a like this is based on VRRP This is a well-known network protocol to to provide H8 to virtual IPs And it also monitors HAProxy So so if HAProxy is stopped keep a like the stop sending the multicast VRRP messages. So this note won't be eligible to be to be the master of the VIP Galera cluster for my SQL Galera cluster is a is an active active synchronous cluster But all we know that we cannot configure it in active active because some open-stab projects do Telephone at the square is this case blocks in the database So we have to configure it in active standby using HAProxy in the middle for MongoDB and Quick and easy replica set for rabbit in queue the native rabbit in queue clustering With a replicated queues for redis in the pacemaker Model redis is handled by by pacemaker, but we don't have pacemaker So we are using sent in it if you look at the you know common HAA configuration of of redis sent in It's always there. So full of information Upstream and the non-AP components is a will say as I said we believe this model is Based on the belief that you don't need to provide a chain to the open stack services So they coordinate themselves with RPC and the state in the database so if we Then look at the in middle components in the overall open stack we can actually like yeah Differentiate between the main two things the control plane and and the computer resources for the computer resources We went with a hyperconverge approach We we have hardware that Has quite a lot of disk and as well as memory on a high density factor form also the compute notes or actually on every note we have 10 gig cards with Where we do bonding for the network H8 to to the different switches in general within our data center we go with a layer 3 spine leaf design meaning that We terminate layer 2 at the top of the rack and then everything else is routed over the spine and the leaf switches The compute note they they are hyperconverged, but they also have Just local storage for the ephemeral storage cases because this is where We think most of the Most of the applications within cloud foundry are actually fine without Any cinder volume so they they are already built Distributed and horizontal scalable so we can really them just quickly spawn on on local cheap ephemeral storage The control plane we at the first when we looked at it We wanted to have kind of like separating many of the things within many of the network It become became very quickly very complicated. So when we went with a Simpler approach, we just went with one network, which is not totally true because there's We have a separate Network for the storage traffic so that we can at least apply QoS more simply on on storage Which becomes important if you go with a hyperconverged approach and then if you look at the services Within the control plane. We were grouping them more or less by by major components And then one of the decisions we also made is for example Horizon requires memcash D for for caching of session and things like that. So we took there the approach that Helper or supporting services that are Like only for one particular role We put them on the role itself. So no need to have to separate them as well And in general we started really with small-sized virtual machines like I think Every machine got more or less only two vc-pues and four gigs of memory We then later tuned things up a little bit more. So Which is also the nice thing if you're on a virtualized control plane because then You can either say we go horizontally, but you can also have minor adjustments More in a vertical layer where it makes sense depending on on the kind of workload that is running on an individual node If you then look at the life cycle so within within our cloud environment We have a whole framework based around CICD. So we have We have multiple stages where in one we develop things a huge rate of breakage and then The more it goes towards production it becomes we only push the releases that actually prove to be stable that went through testing and This gives us really Confidence to actually iterate quickly on on new changes Also, I think this is very important And always to notice that we We strongly believe that should there should be a clean separation between code and configuration So the only things that differentiate between the the various stages are the configuration for the particular stage like DNS servers and the P server or the actual IP addresses, but The code it's packaged once at the very beginning as part of the CI process And then we only push static artifacts through through the different stages. So code never changes When it's pushed through the changes because otherwise you will never know what what actually You're not able to compare the two For for our automation we We use puppet we also started to develop a Puppet deployment orchestrator which helps us Surrounding more orchestration around puppet Nowadays probably a lot of people are usually like using puppet for configuration management and ansible for orchestration Our puppet deployment orchestrator does goes all also more. It also creates all the VMs within the virtualized environment and Then takes care of their lifecycle So this gives them all also the possibility that we actually can also Describe a deployment of such a control plane as as a piece of code Which then can also again being generated and similar to and being fed from a similar tool like AT&T's tool that they presented various times now at the summit that is just The storage of all the configuration data and then you can generate all the remaining code and even configuration data from it This also gives us the possibility to easily scale out purely on API calls For storage as I mentioned We're running hyperconverge compute nodes. We're using scale IO from EMC underneath Scale IO is really nice because it scales with the amount of disks that you put into it and so also with the amount of servers and then object storage We're not Doing object storage as part of OpenStack At Swisscom. We already have a huge utmost installation which provides us object storage Replicated over four data centers. So there was really like no need to to develop object storage services as part of the OpenStack and then Glance uses an object store usually and if you want to like have multiple plans Storage nodes which which is something that you would like to have in the HA environment We're actually then using again at most as an S3 backhand So we're pushing the images to at most once they are uploaded and then we have caches on on the plans Notes so that when people start Are starting to to launch VMs? Then they get cashed look locally on the stack, but we can lose that everything so we can just throw away the VMs As part of our Stack we're also giving we're also running a SDN. It's from Plumgrid It also was really an important part giving the scalability needs that we saw with With the my cloud approach Because they they are really pushing a lot of data. So we needed a distributed network Services so that like the data plane really scales with the amount of servers that we put in This is then how it looks like usually so On more on the left-hand side, we have supported Supporting services like a puppet master or pulpit repository server Then we have a this control plane, which is part of our arista layer free fabric They're like all the VMs are are attached HA proxy Nova Keystone and so forth and All these VMs they are running on top of a red hat enterprise virtualization cluster. So which Has its own scale IO storage. So we're actually like having two virtualization cluster ref is more the traditional virtualization cluster with features from that you know from VMware and things like that and On the we could also put it on another open stack We could also put it on on a VMware cluster. It doesn't really matter It it was what what was easy to build for us and what also made sense for us Then we have a separate network the storage network within the data plane Which which is then connecting all the storage service So for example Cinder requires access to that storage network so that it can talk to all the scale IO SDS then we have like the huge box all the compute nodes from open stack where The tenants are running on top and then the plumb grid gateways being The gateway between the virtualized world and the external physical world Meaning then the outside internet being routed or or another department or so Access within the fabric is kind of like separated but given that we still want to to access Open stack API and things from within tenants and we're bridging there as well into that and and they're using a Pan firewall to secure the access Yeah, so maybe let's look Now that we have an overview of our design It was also quite a journey going on and Alberto will give you more insights into like the various things that we encountered Things that maybe worked well things that we saw that were open stack services are not yet ready Yeah, the whole point is that we lack we lack feel experiences So we weren't able to anticipate the problems with it in the in the hard way So we design it we deploy it we run it we failed we come back to the same phase deploy it again File again and so on so that does the whole point of coming here and say this is beating with you So you you are able to anticipate these problems if you go for this architecture. So let's start with the first one We are aiming for a nested less architecture. We are aiming for an active active Architects without any clustering software. Guess what? Cinder volume is not prepared for active active configurations. So we had to decide Are we going to production without a chain this important component because this is the component that handles the life cycle of the of the of the volumes of the persistent storage, so There is a commitment From the community to fix that. So we thought, okay, let's create a small pacemaker cluster only for this role And we will get rid of it once it gets better So if you you are curious and you want to see what's going on you can go to this to this link There is this is documented by the guy who is assigned to solve this decision So he's very comprehensive and you will get a quite good view of what's happening But the strapping clusters so these components when After a complete failure of all no after the whole cluster goes down they require some manual intervention to come back online So we found the money the the Maria that galera cluster If the whole cluster goes down you you have to identify the node with the last data You had to bootstrap the cluster on that with that note And then you had to join the other cluster members to the cluster with everything you happen something similar Yeah, you had to boot it in order So if the whole cluster goes down you have to identify which node was the last one going going down and Bootstrap the cluster with this node Being the first one and then join the other nodes if you cannot identify at the proper order You had to reset the cluster by removing the amnesia database Which is not recommendable, but it's the only way to recover it and MongoDB it happens that either you know You get these fancy errors saying the cluster was done in an unclear way and you have to execute this repair Procedure to bring the cluster up again So we don't have this again, we don't have a Base maker we don't have something that automates all these steps So you find it in the in the in the in the hardware and you have to prepare your procedure You have to prepare automation call for for recovering that and this is what? Marcel is going to yeah, so simply we don't have any magical recovery. So It becomes very important that within such horizontally distributed control plane. You're actually Monitoring the health of your various components We're using console for that. We're using console also as part of the deployment and the service discovery Which also helps us to further orchestrate? deployments we can do things like the First Galera cluster can set up Galera and Maria DP and the other ones are waiting and then joining the Galera cluster afterwards so we can like really orchestrate these things in between and it also gives us service discovery and at the same time also the service health which we can then consume these events in a Event processing engine which is a own project that we started. It's called orchard. It's based it's in in its core. It's based on Riemann and It it is then able to see what kind of failures are going on within your system And then it can actually go out and like starting remediation Tasks so so we we're looking into automating more of the recovery Processes in an automated way within orchard so that when orchard detects Some patterns that we see at some time it it really goes in and does it automatically without any operators Doing it manually So maybe It's also time to look a little bit at the benefits and the drawbacks that that we saw in general making a summary of These overall architecture and our experience. Yeah, so the first one is Okay, is that we achieve our goal. We had our client like architecture we can we can we can treat the control the control services as the less applications because we don't have any Stiffle component that prevent us to do that We can operate them in a similar way that we operate cloud workloads We can forget about doing backups because the the it will only back a configuration and the configuration is in the In the automation code. We have a dynamic and a gel control control plane So we can scale out we can scale we can scale up We can do whatever we want because we are running virtual machines We are deploying them with with we are managing their control plane as a code We can we just need to call an API to deploy a new a new virtual machine So is your stuff and it's a cost-effective solution Everyone knows that the virtual virtual environments are the best way to save resources because you have this fancy stuff that is called Overcommitment so you can consume more resources that the available resources you can configure more resources that the available resources in the underlying first Satur and the control plane does not depend on infrastructure So you can do life cycle on the hypervisor where the control plane are running Without impacting the the control plane you can even if you don't want to continue with the your current Virtual virtual environment vendor you can you can move these workloads to a new Virtual virtual environment because this is the this is one of the greatness of virtual of virtual machines Yeah, and especially also We saw a lot of opportunities and actually benefits when when we look at clouds like they to operations because We're we're now having We're now having that this distributed architecture We can actually like treat the in middle components and scale them more separately But also like so we started really with the minimal set of services and now we're starting to onboard new services So actually onboarding new open stack services to to an open stack cloud is Becoming you just deploy deploy new roles and you don't like need to care to integrate them in into an existing deployment Also, we're actually thinking that although we we think we Our upgrade path should different But something that we saw is that you could actually start Deploying a whole set of of a newer version of the control plane and then just moving things over Where with a bare metal approach this would require new resources while on the virtualized environment You can just shift things over to the new VMs and then take the old one down also And but I'll Bert already mentioned that the only thing that we really back up are only the user data So it's more or less the database dumps and all the other things are just coming out of automation And if we have nodes that are in a weird state or are just You don't know whatever happened to them You don't know what people might have run on it or or so to to debug things and you're unsure whether this VM is still in a Clean state You just throw them away and restage them and actually this can happen without any interruption of the service itself because we Have a j-prox in front which will which we can drain and then take take the node down Redeploy it and once it's up again and major proxy will just get it back Yeah, the drawbacks. So we have seen the center volume is not Active active reddit. So we we have our base maker cluster running there So hopefully we will be able to to remove it soon Galera cluster Galera is the same thing We have the cell for updates a staff that creates locks in the database So you have to configure it in active hotest and by way using a j-proxy Also Maria the way and rabbit in queue don't scale Horizontally or the same happens to see their volume because if it's active passive you deploy 100 nodes You will have only one active and the with Maria DB and rabbit in queue happens that If you add more more more more nodes, you will have more more more more replication Overhead and less performance and in the case of Maria DB. It has no point You are only talking to one database due to the that's a proxy model in in active a standby Hotest and by so we had the no magical no magical recovery So we don't have anything that take care about the services So if the senior API goes down, it will be down So you will you will need to you need to have monitoring monitoring system That's see they are sent you an alarm and then automation code that bring up the service again or trained operators Or or whatever the same happens with the bootstrapping with the bootstrap of clusters They you lose the if the whole Galera cluster goes down. You need something to bring it up Network partition and keep a like the so keep a like these based in VRRP and VIP uses multicast message to Coordinate the VIP's across the nodes So if node a loses connection with node b node b and note a is there is the active one node b will stop seeing the messages from the node a and we say a I'm the master and you will end up in In explain various scenarios you can mitigate this risk by providing ha in the network infrastructure so bonding some bonding some bonding some bondage and Something that I forgot to mention is that we use run robin DNS to split the to distribute the load across the different HAProxy's because otherwise is point that you have only one VIP So you are only talking with with one HAProxy So if you can do if you do dynamic Roving DNS you can point you can distribute the load across HAProxy But with horizon you cannot because you don't want the user logging in every time it talks to horizon You have to create a sticky session based on cookies So this service cannot benefit from the from the active active active model with the run robin DNS So if we look a little bit at future work We would like to start first with a call out to all the open stack developers like Something where we would really ask you to keep in mind. We're building a cloud. So we should also build the services in a Cloud native style. So it would be good to get services and being built active active from the beginning on Because if we see the the discussions happening regarding Cinder volume, it It starts to becoming more and more pain once the service was more and more established And also something that we saw and like we built in for example for HAProxy We built in health probe for for Galera in the back. So like Having health endpoints where you can query the services. Are you healthy or not? And not only and would be really helpful to keep that running because otherwise It's it's kind of unclear, I mean, what do you check like HTTP return code 200 so as soon as a 500 gets fired back it it's a failure and then we're taking you out but 200 sometimes also just means it just sends you the list of service endpoints So so having healthiness endpoints where you can ask each service. Are you healthy or not? Would be really helpful for for integrating more into HAProxy When we started and so actually if you look at and this architecture taking all the pieces apart It's I mean we could also model things just within containers. We started with this approach one and a half years goes and at that time a Containers were not yet that major and especially container scheduling were not major enough yet But we definitely saw at this summit that a lot of Things happen within that part to project Kola evolved really a lot There are now Schedulers around like Kubernetes we saw many presentations showing the how they work together so this is something maybe we are considering for the future that that's to To you to rather package the service not within individual VMs But just in containers so that we would even get rid of the overhead of a VM This is something that we will definitely investigate and something other is also the usual discussion About hyperconverged storage or not at the moment We're not seeing any bottlenecks within our setup when it comes to the hyperconverged storage So it's If if we talk about performance we we see other challenges When it comes to hyperconverged storage, but like the usual reasons for it being performance or memory Exhausting on the compute nodes. We don't have the problem at the moment So so therefore it's not that high on the priority list to reconsider that decision But it might be something that comes up Yeah, and the that's it from our side Thanks a lot for listening and I guess we might still have some time for questions No, no good Otherwise, please come up and you can ask us