 Welcome to this OpenStack on a Silver Platter session. My name is Frédéric Le Pied. I'm VP Software Engineering at Inovance. Thanks for coming. I'm Emilia Macky, Software Engineer at Inovance. Here is the agenda for the session. We'll start by discussing the reference architecture we are using to deploy OpenStack. We will show you how we deploy OpenStack. Which choices we made to have scalable deployments. And we'll go through the bare metal part, the hardware detection and validation. And then the step-by-step configuration using Puppet. And a sanity test to validate that everything is okay after deployment. And finally, the most important part, how to upgrade your platform. Okay, so let's talk a little bit about the reference architecture. So you can see in the picture that it's a famous architecture where you usually split out the services. But let's have a look at which component we use to deploy. So if you look in the top, we use to deploy some load balancers with HAProxy and KeeperLiveD. And then we have some OpenStack controllers with OpenStack APIs, schedulers. We also run MySQL with Galera Cluster, RabbitMQ and MongoDB. For the compute, we use KVM-Approvizer. For the network, we use to deploy the MLQ plugin with OpenVswitch. And for storage backend, for Glenn, Cinder, et Nova, we use to deploy safe cluster. So let's start a little bit by the story of deployment. So let's start with... Yes, how we do bare metal deployment. So we have taken an unusual approach to use image to deploy bare metal systems. So the idea is to have a scalable way of deploying a lot of systems at once, not repeating the same installation of packages, configuration of users. So we prepare everything in advance and we download everything on each system on the same operation. So to do this, we have to prepare in advance our images and we have built a system that is a Linux distribution agnostic that is running on top of Red Hat Enterprise, Santos, DBN, Ubuntu. And the idea is to build these images once and to be able to reproduce the installation at will. That's very important. This way, we can install a system one day and if a customer or someone has trouble one year later, we're still able to reinstall the system exactly the same way and to debug the issue without any trouble for installation. So the philosophy of these images is to build the image with all the needed bits but with nothing configured. This means no service is activated. So when you boot the system, nothing will start and then we let configuration management system like Puppet do the configuration of the services, start the right one, configure the files that are needed for the service and activating what is needed for the services. So how we do it for OpenStack itself, we have chosen to do only two types of images, one for the installation server and one for OpenStack itself, all the services. At the beginning, we were designing a lot of different images but the combinatory was very difficult to cope with because sometimes we want one service and one system, sometimes another services and another one. For example, you want to have the storage and the compute on the same one sometimes, sometimes you want separated for performances reasons. So we decided at the end that it was too difficult to manage this combinatory and we put everything on the same image for OpenStack. So all the services are installed using packages and then we deactivate all the services at boot and we let Puppet do the job at boot time, at configuration time. And for the installation server, what is it exactly? It's just managing the Puppet master, the bare metal provisioner that is just a PXE HTTP on IPXE if needed. And for the upgrade server, we rely on AirSync but we will explain this later. And we also use Jenkins as our way to gather logs and to show the status of the installation. We will come back to this later. I will explain you just how it works in terms of bare metal deployment in little bit more details. So when we boot a system, it needs to be configured in factory to boot on PXE. So it first boot and request a PXE image. So we send the usual kernel image with our special RAM disk containing the installer. So the installer is loaded on the system. It detects the hardware, send back the list of hardware that has been detected to the installation server. Then the installation server has a set of rules saying if it has such amount of memory, it is a compute node. So we will do a configuration this way for this compute node. And if it's a lot of disk, we will do a configuration for a storage node, for example. So we have this kind of rules that we will be able to send a different configuration according to the hardware, in fact. So the installation server is then sending back the configuration script. And the configuration script is only here to configure the networking, configure the red, partition the disks, and then download the image. Put the image on the disk directly on the newly created partition system. Then it set the bootloader, reboot the system, and then the system, at the first boot, start using CloudInit and it provision the SSH key to be able to be accessed by the installation server later. And it takes something like, I don't know, six or seven minutes to do the whole process. It's very fast and it's completely distributed so we can load a lot of system from the bare metal to having a complete system ready to be configured by Puppet. And what we have also this tool called hardware health check that is used to validate that the hardware is ready to accept an open stack configuration. Because when you have a large set of system you want to deploy, usually you have some failure on disk, you have some failure, you can have some misconfiguration or some cabling problems. So I talk about this after the eDeploy thing but it's using the same architecture. That's why I'm explaining just after. So instead of loading the installer RAM disk we load a benchmarking RAM disk that is used to do a lot of benchmarks for the networking, for the disk, and for the memory and CPU. We are using a standard benchmark tools like FIO, CIS Bench, and NetPerf together of statistics to verify that the hardware is correct and is performing as expected. And it's very useful to detect that we have a non-functioning hardware in the group of systems. But usually it's before the installation. I explained it after but because it's exactly the same infrastructure as the installation. So once you have your server boot and the hardware is configured you may want to configure OpenStack services and we are using Puppet for this. So when we started to deploy OpenStack we had this trouble with Puppet where we could not have a very clean orchestration where we wanted to first deploy let's say the database then we want Kstone and then we want to run the OpenStack compute services. So we started by working on the step-by-step deployments. Therefore, to doing this we are using the upstream Puppet module that you can download in Stackforge. We also built an OpenStack module which is available on the Innovance GitHub which is a high-level Puppet module to configure OpenStack in a flexible way and all the services are highly available also. This module is very flexible with some backend support that means you don't have to install safe if you don't need to. You can install NetApp driver or iScozy or whatever. The Puppet module is entirely unique tested and we are building the Puppet configuration during each step. So the next slide is about showing you the workflow of deployment. So to validate each step of the deployment we are using a framework which is ServiceSpec. ServiceSpec is an integration framework. Then you can write some tests like check if the service is running, test if the service is listening to the right port, etc. So we are using this ServiceSpec thing to validate that Puppet did the job that we expected. So basically when you want to have a feature in the deployment you may want to have an integration test in ServiceSpec to validate that Puppet actually did what you expected. So the next slide is about the workflow of the step by step deployment. If you look forward you can see that each step we have to run Puppet first one time and then we run ServiceSpec to see which tests are working or not. If we have some failing test we just run Puppet one again and until it works we run Puppet with a limit of 5. That means after 5 times if the ServiceSpec thing still doesn't work we consider that we have a failure somewhere and with ServiceSpec it's very easy to debug what fails. Let's say for example you have a Kstone it's not running during the deployment you won't continue the deployment until the end. You will stop at step 3 where Kstone is installing by Puppet and you can dig directly into the bug and fix it. So each step is building a report that means that you can you can do also some statistics of the deployment you can see how many times you need to deploy each step et how much time the tests are using are consuming so you can validate the deployment if it's too long maybe something's wrong in the configuration. So that's... I think we will keep this approach we didn't took this idea at the beginning it was like a random idea and now we are thinking about keeping this model for the future because a step by step deployment saves us a lot of time to debug the deployment when you have some issues. Once you have OpenStack and safe deployed in your platform you may want to validate that all the services are open running so to do that we just use all the tempest framework so in tempest you have some functional testing to validate the API and CLI services it's about 1,600 functional tests and we also we also use an interesting part of tempest which is javelin javelin is a tool to validate the upgrade when you create some resources before the upgrade you may want to validate that the resources survive to the upgrade process so javelin is tempest is able to do that we are also validating with the hardware validation we are validating that the tool is working and we don't lose performances during the deployment and we have a last feature in sanity process which is a validate that we can install OpenStack in OpenStack directly so we don't have to boot physical servers we can just use a e-template and boot the platform in OpenStack so to perform all the deployment process we don't have a nice UI or we don't have a custom interface we just use Jenkins and we have 5 jobs doing all the process the first one is obviously e-deploy the first one is hardware validation within e-deploy and then if the hardware is validated we will bootstrap the different servers by installing the holes after this we will configure the nodes by using puppets and we will run the sanity process to validate OpenStack is up and running and by the end we will test the upgrade process from the last release to the new one and validate that everything is still working after the upgrade we used to run sanity again to validate that we didn't lose any features in OpenStack upgrade management with doing upgrades in the room who knows how to do it only a few hands ok so that's a difficult topic in OpenStack so that's also one of the reasons to go the image way of doing deployments is that we are able to do upgrades of a system using images this way we can ensure that if we do an upgrade or if you do an installation we will have the same system at the end so that's very important for us because when you use packages depending on the order on which you install packages it can end up in different results in fact on your system it's very important so we are sure that's predictable and reproducible and also the other part that is very important is that we are able to upgrade a large set of systems without downloading image packages list of packages computing the differences by the package manager on all systems that's a little bit a waste of time on all systems have the same for a lot of compute nodes they will all have the same list of packages we don't need to compute on all the system the same list multiple times so what we do is that we are using AirSync in fact to synchronize an image on the installation server to the different installation target systems in fact and we don't do this randomly because it will end up with a mess we are orchestrating all this using Ansible playbooks we'll explain it to you in detail how we do it and then we configure everything what is needed after the upgrade still using Puppet Ansible is just here to do the orchestration and of course we will after the upgrade validate everything and using the same test that we are using for installations what we do is that we serialize the update because it will be a combinatory explosion of different paths to do the upgrade so we have only the path from N to N plus 1 or N plus 1 to N plus 2 only one by one we have Ansible scripts to do one by one that's what is very important to understand because Ansible scripts are designed for each upgrade we'll explain you why so we are going to show you two use cases to upgrade a platform the first use case is a minor upgrade where we don't have much open stack resources to upgrade so basically it usually happens when you have like for example here you have a MySQL upgrade which is kind of basic if you compare to open stack upgrades so the first step is to prepare the upgrade so you have to look at the image image build difference between the two versions and eDeploy is able to tell you which packages are upgrading for the next release so basically we use to design the Ansible playbooks after reading the package difference so in this use case we have only to restart MySQL and since we are running a cluster we don't have to take care of the backup or whatever you just have to restart the service so if you look at the execution again it's kind of basic we have Ansible playbook to stop MySQL to upgrade the node by using eDeploy so like Frederic said the upgrade process is very fast by using AirSync it will only sync the upgrading packages from the master node and after the upgrade which takes maybe a few seconds you will be able to start MySQL there is no downtime for this kind of use case after the Ansible work we used to run Puppet one time we used to validate all the steps again to validate that Puppet did the job that we wanted in the new release and after this we validate the upgrade by just running the sanity job that I showed you before by running Tempest and Javelin to validate the last upgrade resources after this step we can say that the upgrade is validated and we can move on and make a release the next use case is more interesting if you consider OpenStack upgrades so this one is about upgrading from HighSouth to Juno and you have also a kernel upgrade so you wonder that we have to restart the nodes and since you are upgrading the kernel and you are upgrading OpenStack you may want to restart the nodes to be sure that you run the last version of the kernel so this is something that you see by the same way like before you can see the package difference between the two versions and this is the most difficult part of the process it's think about how to upgrade the platform without downtime so you have to write a smart orchestration by using Ansible if you look at the Ansible part for the major upgrade use case you will see that we also stop the services we run the eDeploy upgrade and the nodes are rebooted and we wait for the reboot obviously all the services are HA and managed by the HA proxy or pacemaker so we don't have to manage these parts they are automatically add to the cluster when they reboot for the compute nodes you wonder if we lose the VM connectivity and the answer is not because Ansible is able to take care of the live migration before the upgrade so if you want to upgrade your compute nodes you can just evacuate all your instances to another compute nodes and then take care of the upgrade sometimes during the upgrade process depending of the open stack version you may have some downtime for some services but usually if you write it with Ansible and you manage a configuration with puppets everything is automated and pretty fast if you compare to manually process so after the Ansible work again we run puppets and we validate all the new steps all the new functional tests after we also run Sanity to validate the new features in OpenStack because we are running a new version we will have new functional tests in Tempest so we are able to say if first the resources I created before the upgrade are still there and then I would like to know si OpenStack are working so this is integrating in the Tempest process so you wonder how you can reproduce by yourself everything is open and you can find all the code in the OpenStack project so the puppet modules are located in Stackforge we have some custom repository in the Inovance GitHub like E-Deploy is the biggest one this is the project that you may want to look for the bootstrap we also share the E-Deploy role scripts that build the image and we have the Puppet OpenStack Cloud Puppet module which is the high-level module that you may want to use to deploy OpenStack and the two other repositories are related to the configuration management so you can go ahead and test by yourself we have also a documentation and you can report us any feedback is welcome thank you very much if you have any question or feedback do we have a micro? do you have a micro for the questions? yes we have thanks il va y aller on se le passe pourquoi Ansible quand vous avez déjà Puppet? Ansible fait la orchestration c'est un set de tasks que nous voulons orchestrer sans Puppet Puppet est là pour manager la configuration et la service la service stop ou la start management si vous voulez commencer la compétition Puppet va manager la configuration Nova et le extra Neutron mais Ansible va seulement être ici pour manager le process de upgrade je veux stopper cette service et ensuite je veux upgrade la node et je veux commencer la service mais si je veux upgrade la configuration c'est fait par Puppet donc on choisit Ansible parce que pour maintenant dans Puppet vous avez des orchestrations ce n'est pas exactement ce que nous voulons faire donc on a trouvé que Ansible était exactement ce qu'on a besoin généralement Puppet est plus pour orchestration sur un système local et Ansible est mieux d'utiliser pour faire l'orchestration sur un set de systèmes comme un upgrade de roulement ce système avant, ceci après etc Puppet vient avec M Collective vous avez regardé M Collective il vient avec Puppet et c'est un orchestrateur donc pourquoi vous choisissez Ansible et non M Collective donc la question est pourquoi en utilisant Ansible et non M Collective non M Collective oui parce que M Collective est plus pour faire un tasque distribué en fait dans le même tasque sur tous les nodes ce n'est pas un orchestrateur on ne fait pas un tasque, alors qu'un attend pour le tasque pour être terminé et puis on le fait pour l'autre ce n'est pas que notre M Collective est design c'est plus distribué en fait donc c'est pourquoi on utilise Ansible plutôt que M Collective basicalement avec M Collective on peut simuler on peut construire sur le top de M Collective, j'apprécie avec vous mais pour Ansible c'est natif c'est déjà design ok, merci c'est ce gars j'ai une question pourquoi en utilisant E-deploy plutôt que des outils comme Razor ou Aronic j'ai vu un slide de comment vous avez eu le workflow je pense qu'il y a des outils existants je suis juste inquiétant pourquoi vous choisissez d'implementer oui, Razor est le plus close le truc que nous avons designé le premier était pour les outils pour pouvoir faire des outils et ce n'était pas disponible dans les autres outils maintenant donc l'idée c'était nous avons designé le premier outil et nous avons dit que c'est sympa nous pouvons utiliser la même chose pour faire les déploiements et c'est pourquoi nous avons construit et nous collons avec ça parce que c'est le même outil pour faire tous les différents étapes mais ce gars est attendu il y a plus longtemps quelles outils vous utilisez pour manager les databases d'underlay comme MongoDB et je pense que quelles outils que vous utilisez pouvez-vous répéter oh, pardon quelles outils vous utilisez pour manager les databases comme MongoDB et MySQL et quelles outils vous utilisez pour MySQL nous avons utilisé Galera avec des packages MariaDB et pour MongoDB nous avons utilisé les packages mais en tant que modules Puppet pour management Galera oui le module Puppet le module Puppet Labs ok, je pense que quand j'ai regardé, ils n'avaient pas le temps merci je suis juste curieux qu'est-ce que vous faites pour RAID & BIOS quand vous avez BearMetal RAID & BIOS configuration est-ce que vous assumez que c'est tout prêt oui, pendant la détection du hardware en fait, nous avons regardé toutes les informations de BIOS en fait, beaucoup de des choses qui viennent du hardware version BIOS même les armes, les settings tout est regardé et nous pouvons détecter la misconfiguration si ce n'est pas homogène nous pouvons pendant la validation du hardware nous pouvons décider d'arrêter pouvez-vous aussi faire des choses comme J-Bod ou RAID 10, faire ces décisions et faire ça automatiquement ? oui, pendant la configuration les scripts en fait sont juste Python scripts et nous pouvons faire tout ce que nous voulons en fait, le management de la RAID en faisant G-Bod ou tout ce qu'on doit faire comment vous dealer avec comment vous dealer avec BIOS ? vous vouliez faire le même hardware ? peut-être si c'est Dell vs HP quelque chose d'autre, comment vous dealer avec BIOS management ? comment vous updates BIOS ? oui, nous ne pouvons pas manager un objectif BIOS, c'est quelque chose qu'on doit travailler sur maintenant, nous relâchons les outils des hardware mais nous avons des outils pour valider que tout est au même niveau nous ne pouvons pas manger le BIOS en fait, c'est possible si l'application de l'abandon vous donne pour l'HP, nous utilisons l'application de l'HP server si l'HP vous donne l'application de l'application de BIOS, c'est possible parce que c'est seulement Python scripts qu'il faut déployer pour configurer le hardware ok merci beaucoup