 Hello, everyone. My name is Tomasz Paszkowski, and I'm Cloud Solution Engineer in Intel. Today's talk is OpenStack on Kubernetes, one and after. Agenda for today's talk is pretty simple. So first, we will focus on what was motivation behind this initiative, what's the current landscape, how was the evolution during last year, and there will be a short section for Q&A. So I was part of the team that was responsible for preparing OpenStack on Kubernetes demo for a keynote session on OpenStack Summit in Austin in 2016. It was well received, and it was very successful. But the main goal for that was that we wanted to simply make things easier. And by making things easier, we understood that OpenStack lifecycle management was quite complex. And as OpenStack, it's simply just an application, but a really complex one. Putting it on top of Kubernetes will help to orchestrate its deployments, but it would also simplify second day maintenance. OpenStack community put a lot of effort into making rolling upgrades of OpenStack itself possible. But still, there was no tool which can orchestrate those upgrades. And we wanted Kubernetes to become that tool. And Kubernetes also offers a modern way for AHA. So everyone who is aware of Pacemaker, so I really like that solution. So everyone who used Pacemaker and Kubernetes knows that Kubernetes is far more advanced and far more user-friendly than Pacemaker. And the other motivation was that we wanted data center operators and users to offer a solution or a platform which can help them to maximize their infrastructure utilization. The advantage of the public cloud operators is that they have put significant R&D into making sure that they can extract as much as they can from their bare metal resources. And this is not simply possible for smaller companies. And with approach that we can put containers and VMs on the same set of machines, we can put this utilization slightly higher. And we also wanted OpenStack to become the first class citizen on top of Kubernetes and vice versa. As we have seen on this summit, Kubernetes is a very popular application which is deployed on top of OpenStack. As OpenStack offers a lot of advantages. And this is more or less how part of the OpenStack deployment looked like on top of Kubernetes a year ago. And today it looks the same. It's still a couple of pods. Some Kubernetes controller, some Kubernetes service, and some ingress controller for that. But what has changed over the last year that the ecosystem is much more richer? So we have multiple solutions. So one of them is FuelCCP, which is only one which offers full running upgrade. So with FuelCCP, you can simply migrate from Newton to Mitaka without downtime on data plane and without very small downtime on control plane. There is also Kola Kubernetes. There are two OpenStack Helm projects where one is vendor-driven by SAP. And there is also Stackanetys. All those projects differ somehow. But there are also projects which are strictly focusing on delivering just container images. And there is Kola, which was the first and the most richest one so far. There is Locky, which was recently introduced. And it seems to be quite successful as it offers really simple containers and lightweight containers. And you can also use containers built by FuelCCP. Also, you need to use their own specific tooling to be able to take advantage of those images. And each of those products or solutions is using different tools to deploy containers to the Kubernetes and template configuration files. So Stackanetys was using KPM. FuelCCP is having their own custom tooling written in Python, which is taking advantage of JINJA templates. Kola Kubernetes went for somehow hybrid type. So they are using Ansible with JINJA to template configuration file. And then they are calling Helm to deploy containers to the Kubernetes. And then there is also OpenStack Helm project, which was recently put into the OpenStack family, which is entirely Helm. And it's using Go templating for configuration file templates. So those old projects differ somehow. But at the end, the goal is the same to put OpenStack of top of Kubernetes and try to solve the second maintenance issues with orchestrating rolling cup grades for OpenStack. And how was the evolution of all those solutions for OpenStack deployment on top of Kubernetes during last year? It was mostly driven not by the OpenStack development base, but how Kubernetes was developed and what was the base of Kubernetes development. So initially, when we started to work on Stackanetys, the stable release of Kubernetes was 1.2. And right now we have 1.7. So there is a lot of stuff that change in the meantime. But first we started for using demon sets for pods that we wanted to make sure that there is only one pod per machine running. So this is the case, let's say, for Novacomput, for a libvirt, open with Neutron agents, and so on. But demon set on that time has a significant disadvantage that simply was not offering rolling upgrade. So there was no possible to do the rolling upgrade on the demon set. And one of the reasons why we went with OpenStack to Kubernetes was that we can simply upgrade OpenStack. And this was hopefully fixed, or it's actually fixed in Kubernetes 1.7. But it took some time Kubernetes community to fix that. And the other disadvantage of demon set is that it's not supporting drain by design. And why this is important for OpenStack on Kubernetes deployment, I will tell it a few slides later. And also demon set requires you that if you want this specific pod to deploy it on a machine, you need to label this machine, of course, assuming that you want to limit what kind of machines are used for control pain and which machines are used for hosting the VMs. With deployment, the case was slightly different. So deployment from the very beginning was offering rolling upgrades. So it was possible to upgrade your application using native Kubernetes features. But until Kubernetes 1.4, it was not really possible to tell that I want each pod of this controller to be deployed on a separate machine. So let's say you have your Nova API service. You deploy Nova API in three instances. And you want to make sure that each of those instances would land on a separate machine, because this is how I should work in data center. Before Kubernetes 1.4, it was not possible. But Kubernetes 1.4 introduced a feature which is called Node Affinity and Anti Affinity. So I can simply tell Kubernetes that this controller should make sure that each pod of this controller would land on a different node. So there will be no pods from the same controller on the same node. And the other nice feature that was that landed in Kubernetes 1.4 that's also using Node Affinity that I no longer need to use labels to select on which machines I would like to use for control plane. I can simply put machine names into the manifest using ragex. And as you can see on the example, I can simply say that those specific pods should land on those three machines. So let's say I'm having three control plane machines, and I can tell that those pods should land on just those three machines. And there is also pod affinity. So when you look at Nova Compute, you usually want Nova Compute to be run along libvirt. Because simply to launch VMs, you need libvirt. And with pod affinity, you can tell Kubernetes that as soon as Nova Compute pod would land on this node, please also deploy libvirt container on this node. And this applies to neutron agents as well. So if there is open v-switch, neutron open v-switch agent running on this host, please make sure to deploy open v-switch v-switch d and open v-switch db server on this host as well. But the downside of the deployment is when you expand your cluster and you want to increase number of compute nodes, so you simply need to increase the number of replicas, let's say, of Nova Compute. But increasing number of replicas of Nova Compute is simply not enough because you also need to increase the number of replicas for libvirt. So that you have both containers on the same number of replicas on the cluster. And the last thing is the pod layout. So initially, because there was no pod affinity and anti affinity to make sure that neutron open v-switch agent would also be deployed together with v-switch d and db server, we were combining those three containers within the single pod. But this has some disadvantages again. For example, when you update neutron open v-switch agent, it requires that you restart all containers in the pod and it meant that you also restarted v-switch d at db server and it meant that all the services that were running on this server were simply not having network. So that's why most of the deployments nowadays went for a model where each pod is having just a single container. So whenever you upgrade application in that container, only this application is affected. So there is no downtime distribution when you upgrade application. And this was also not a rule a year ago. And going back to the drains, I mentioned that DemonSet is not supporting drains. So as I said at the beginning, we wanted to make things easier and simpler. And Kubernetes has a very nice feature which is called drain. So whenever you would like to put your note into the maintenance state, you can ask Kubernetes to remove all the containers which are running on this note and put them over the scheduler and put on a different note. And what we wanted that, let's say in OpenStack Word, just killing the container with Nova compute is not enough because you usually have virtual machines running on that machine. So if you want to make sure that those virtual machines would not be impacted by an upgrade or the machine maintenance, you need to live migrate them. And we came up with the solution which we call Nova Kubernetes drain which simply listens to the events on the Kubernetes event skew. And whenever there is a drain event, it's also start live migration of virtual machines. So you simply ask in Kubernetes to empty the note from all the containers. And the Nova Kubernetes drain knows that and also tries to live migrate all of the machines out of this machine. So you can simply and safely go with the maintenance and have no impact on the services running in your cluster. And also, you can ask, so let's say you are run out of the capacity on a particular note as one of the applications or VMs consumed too much. You can, with Qubectl Cordon, you can ask Kubernetes to not schedule any more containers on that. And Nova Kubernetes drain intercept that as well and disables all future scheduling on that compute note as well. And it, of course, works the opposite side. So whenever you enable scheduling in Kubernetes, Nova compute note is also enabled for scheduling in Kubernetes. And all those magic is implemented as lifecycle hook in Kubernetes. So whenever there is an action to kill the container, Nova Kubernetes drain intercept that and executes the live migration commands. And this feature works only on deployment controller. It does not work on the demon set because demon set was created just to hold infrastructure services like Fluent D, which should live as long as the machine is alive. So that's why it's not always a good idea to put an application inside the demon set controller. And as I said, we started our initiative with Kubernetes 1.2. And it was a very funny thing that we wanted to start using config maps to start configuration files. But it turned out that those config maps when mounted as a files can be accessed only by the root user. And we wanted to run our containers as non-root user. So we needed to upgrade to a development version of Kubernetes 1.3. There was, until Kubernetes 1.4, there was also a problem with delivery in krisolv.conf file to the containers which are running with host network configuration. So to work around, because when you were mounting config map, you were shadowing all the files that were present in that directory. So simply there was no way that you wanted just to shadow one file from ETC. So in our case, risolv.conf, there was no way that you will not shadow all other files in ETC directory. With Kubernetes 1.4, with introduction of SAP path mechanism, you can select just a single file from the config map. And they explicitly say that the single file need to be mounted as the single file on the file system. This was introduced in 1.4. Previously, we were just using Kubernetes entry point, which I will tell a slide later what it is, to copy the file from the config map to the appropriate place in the file system in the container. And with Kubernetes 1.4, there was another feature introduced for files that were mounted from config maps, because until 1.4, there was no way that you can write to the file which was mounted from the config map. It was enabled in 1.4. And it was manifesting when we were deploying RabbitMQ cluster. And there is file like RabbitMQ cookie, which is containing some password for the RabbitMQ cluster. And RabbitMQ, by default, assumes that it can open it, and it can open for write. And as config map files were not openable for write, it resulted in error. But hopefully, it was fixed in Kubernetes 1.4. And going back to the Kubernetes entry point, so even today, Kubernetes itself is not having native dependency management mechanism. So we came up with a simple, clever software, which initially was deployed inside the container running as a wrapper, which was talking to Kubernetes API and checking if all the dependencies were resolved. So it's very simple, because every container can access Kubernetes API. And with help of that, you can check the state of other containers running. So let's say, if Nova API is requiring MySQL database to be running, you can check MySQL service in Kubernetes if it's having endpoints. And those endpoints are in a OK state. And this applies to other services, like if, let's say, Nova compute requires Nova Conductor to be running, so you can check the same. So you can check if there are pods with Nova Conductor deployed on the cluster. And if there are deployed and their state is OK, then you can treat this dependency as resolved. And so the initial way of deploying was inside the container with the actual application. So it has some disadvantages, because you cannot take simply any image from the internet and put it in your cluster. So you need some custom build process. But with Kubernetes 1.3, there is so-called you need container. So you can simply put Kubernetes entry point in the container and hold the execution of the application container until all dependencies are met. And when they are resolved, any container is exiting and then starting the actual application. And Kubernetes entry point was developed as part of Stakanati's project. But it seems that it's no longer maintained. So if anyone is willing to maintain that part and expand his Go skills, so everyone is welcome to join maintenance. And this has become really important, because all except few LCC projects are using that part of orchestration. The other issue that we needed to resolve was how to safely deploy my SQL database on top of Kubernetes, because you want some kind of HA. So your OpenStack deployment is always working. So Color Kubernetes solved that by having just one replica of a database. It was a year ago. And they take advantage of the state full set. So it means that they had just one instance of database, which was having a storage deployed on top of Cep using persistent volume claims. And when this database was down because of the hosting this container was down, then Kubernetes was taking care to respond this database in a different place of the cluster. But it meant that it was not full HA solution. So it was not active-active. It was active-passive. So the downtime was about a minute. So it's the time that is required for the Kubernetes to respond the bot after an outfailer. Instacanetes and we from the beginning wanted to have a Galera. So we came up with some clever software, which was, again, talking to Kubernetes API and checking if there are other MySQL Galera members in the cluster. And if yes, this newly started container was joining them. And when it was a fresh database deployment, so we were just deploying OpenStack for the first time, this situation was detected as well. And there was a seed node started, which was doing all the database initialization and connecting all other members. Like let's say you have three replicas with this seed. And when all those three members connected, then the seed was exiting as it was just pure Kubernetes job. And with that automation, we were able to avoid using pet sets or stateful set, as it's now a day called. But also stateful sets offer predictable host names in Kubernetes, which is sometimes usable because you don't have this random crap. And it's initially until Kubernetes 1.4, it was the only controller which was supporting persistent volume claims. And crown jobs, so as you probably notice, most of the OpenStack on top of Kubernetes deployments are not or we're not using Furnet tokens in Keystone because there was a problem of rotating those tokens. So there was no Kubernetes native mechanism that can predictably rotate those tokens. But this situation has changed in Kubernetes 1.6 as we finally have Kubernetes crown jobs. So we can simply ask Kubernetes to rotate those tokens every five minutes, 10 minutes, an hour. So it depends how an operator wants that. Of course, previously, that was possible as well. So you can simply launch a container with simple bus script, which was doing an infinite loop, doing token rotation, and then sleeping again. But it required a lot of different tricks. So you needed to access the file system locally. So you needed to deploy those ports exactly on the place where the Keystone containers are running. With the Kubernetes crown jobs and with the help of Pod Affinity, you can always make sure that this job will run exactly at the place where the Keystone container is running. And you can safely rotate those tokens. Another use case for this feature is removing that agents as Kubernetes containers are having those random host names. So let's say you have Nova API service. And every time when Nova API container is started, its name is Nova API minus xxxx some random string. And for example, you can have one node which is malfunctioning in your cluster or malfunctioning network. And it can result that Nova API container would be respond every 30, 45, 60 seconds. And it would mean that every time it will start, it will register in Nova as new agents. So you will have like thousands, hundreds of Nova API agents which will produce a lot of dead agents. And usually, you would like to remove those agents. And of course, you can have again a container which is running with infinite loop, which is checking for that situation and remove that. Or you can go to the Kubernetes crown job, which is every five minutes, every hour, or every day doing that check and removing that agents. So this is a really nice feature that will really help to keep the OpenStack deployments really clean. There was also a bunch of other improvements not necessarily related directly to OpenStack. So before Mitaka, live migration in OpenStack required that DNS in underlay was recognizing all host names in the cluster. And when you deploy your cluster in a legacy way, this is usually true. But in more dynamic environment, like Kubernetes, this is not always true. So this was a very helpful feature that you can simply set in the configuration file that this is the IP address that should be used to live in great virtual machines to this specific compute node or out of this specific compute node. And as we know, we can easily extract IP address of the container when we start Nova compute service because we have environment variable, which is telling the pod IP address. And what was really said when we discovered that, that Cinder volume, for a few cases, was not having HA. And again, because container host names are dynamic, so let's say you have this Cinder volume service running and you create a volume. So this volume was connected with Cinder HA minus xxxx. And as this is a dynamic environment, this container can get skilled. And it would be respond with a different host name. But for Cinder, it meant that this volume is no longer controllable because the Cinder volume, which was the owner of this particular volume, was dead. It was no longer running. And even though everything was fine in the SEV cluster, we had a lot of Cinder volume instances. We cannot manage the volume. So it was fixed in Okata, and it definitely helps for Cinder case on top of Kubernetes and not many projects were aware of that. So the only project that were aware of that was fuel CCP and Stacanetes or others were never fixed the issue with the older releases. And there was also a bunch of the work that happened in improving Neutron L3 agent split brain. So again, when you have this dynamic environment where your containers can be deployed, as soon as the Kubernetes controller detects that there is a problem, so we can end up in a situation that some agents are still running, and they think they are master from the VIP perspective. But it's enough only that they will call to Neutron API to figure out that this is not necessarily true. And this is exactly what happened in Neutron to improve that. So we have much better split brain provision in Okata right now. And there was also a lot of very exciting work happening in clustering Rabi Tmq on top of Kubernetes. So with the help of auto cluster plug-in and a lot of work that Mirantis did in hardening that plug-in with HCD backend, we can say that we have like ballot proof clustering solution for Rabi Tmq on top of Kubernetes. And you can simply scale up, scale down, and the cluster still works. And of course, it's really ballot proof. So it can survive different bed scenarios, which could happen in your cluster as well. And the initial release, which could be considered production already was around Kubernetes 1.5. But as everyone was progressing forward and forward and putting more and more components on top, OpenStack components on top of Kubernetes, at some stage, we figure out that simply it's not that simple that OpenStack deployment on top of Kubernetes became really complex. There is a lot of manifests. And those manifests are really long and really complex. And the key problem here is that there was no mechanism which could prevent code duplication. So the same sections of the code were present in the API code, in the Keystone code, in the volume code, in the Neutron agent code. And whenever you wanted to change something, you needed to go to 12 files to, let's say, you've discovered some problem or a bug, and you want to fix that. And you needed to fix that in 12 places, which is crazy. So FuelCCP was in the most lucky position because they have their own tooling. And they were doing a lot of templating from the Python code. But other projects like OpenStack Helm were not that lucky because they simply relied on duplicating the same code in multiple files. So there is a lot of efforts which are aiming to fix that. So the first which is happening in Kubernetes are pod presets. So you can simply create an object in Kubernetes, which is called pod preset, and then include this object in manifest. At the moment, those pod presets are supporting very simple statements. But of course, it will grow with the time. And I hope that in the future, you would be able to template most of the container statements in the pod presets. And also, there are some separate efforts, like creating Kubernetes manifest with JSON net. And this project is driven by Heptio using their Libsonet project. So we had some very good experience with JSON net. It's Takanetis, and we really believe in that project that it will simplify a lot. And of course, Red Hat is also trying to solve the issue as well with their on-prem compose project. And what we think it's going to happen in the next couple of weeks, couple of months, so we see that Oslo community figured out that nowadays, it's not really the desired situation where you hold all configuration in a configuration file. That it's a good idea to hold some part of the configuration in HCD backend. And we think that we should go even further that opens up on top of Kubernetes is like we have so many projects that we should go and add Kubernetes ConfigMap backends to Oslo. So Oslo can read directly from Kubernetes ConfigMap objects that would simplify a lot deployments, because you simply could forget about all the templating and mounting those ConfigMaps inside the containers. And the other thing, which is a bit more technical, this is shared PID namespace. So Docker recently introduced that multiple containers can share the same PID namespace. And I hope that the support for that will land shortly in Kubernetes. And this is extremely helpful in a situation where you have malfunctioning service and it's creating zombie process. Because in a regular container, there is no process, innate process, which is with PID 1, which can take control of those zombies. So SAP guys find a workaround for that. So simply in every container, they run Dampinit software, which is taking care of that. But with shared PID namespace, we can have a cleaner solution where we still follow the best practices that there is only one application in the container. But this Dampinit could be deployed as a sidecar container, which is also having the same PID namespace as other containers and wiped of zombies on demand. And while we were fighting so hard to ensure that OpenStack itself is upgradeable, and there is a tool to upgrade the OpenStack, we forget about Kubernetes upgrades. So this is very unlucky that Kubernetes upgrades are not so easy right now. There is no tool which can orchestrate that for users, and which is even more scary. Upgrade from Kubernetes 1.6 to 1.7 requires downtime of all services in Kubernetes. I mean, all containers needs to be shut down and started again. I hope that this is only one-time action, that it will not happen in the future. But it shows that not everything is perfect in our role. And also, resource isolation is coming to the Kubernetes. So finally, we would have CPU sets. And this is for control plane applications, which we want to ensure that they have a central level of QoS, like API OpenStack services. This is a very desired feature. So we would be able to have dedicated course for the API services, which would not be simply there will be no noisy neighbor problems for API services. And as we want our cloud to be running with the biggest possible uptime and be as responsible as possible to our users, this is a way to go in the future. And this is all. And now it's part for questions, if there are any. Hey, I heard you said you solved the MySQL auto cluster problem. Did you use HCD, config map, secrets? How did you do that? Can you elaborate a little more? So solution is quite simple. Simply, before the actual MySQL instance is started, there is a script which connects to Kubernetes API, checks the MySQL service in Kubernetes, and check the endpoints. If there are endpoints, it means that there are other members of this cluster. And then the script is setting those IP addresses, which are extracted from this service as other members of the cluster. And then MySQL is booting up. So you can check it how it's done in Stackanettys in OpenStack Helm. Those are the two projects that are taking advantage of this feature. And OpenStack Helm even reverted that to Bash script. So it's like a very simple. So right here. So as a follow-up to that, so for the database clustering, are you saying that the stateful set is not being used? And do you see that being used at some point later? Yeah, so Stackanettys is not using a stateful set, but OpenStack Helm does use stateful set. And they use it because of the predictable hostname. So it's not a random. It's, let's say, MariaDB0, MariaDB1. But the stateful set is, again, a bit problematic, because stateful set controller does not offer upgrade. But you don't upgrade your database so frequently. So you can forget about that. And just another one. Can you clarify what you did for RabbitMQ clustering? You mentioned something about at-cd back end. Can you explain that? It was not me, so it was Mirantis who did that. So basically, they took auto cluster plug-in, and he proved it a lot. So there was at-cd back end in this cluster. But it was very opportunistic. And in a lot of cases, simply RabbitMQ clustering was running out of sync. So there was a lot of race condition on the cluster startup and split-brain situation. So Mirantis really hardened that part. And as I said, it's bulletproof and proven in multiple production clusters at the moment. And what is that called? Autocluster plug-in. Autocluster plug-in. Yes. You should go to a few LCCP RabbitMQ repo. And this is like a reference design how they take advantage of that in their platform. Thank you. Great session, Thomas. I work in the call-on, call-cubinities project. You mentioned Oslo config. Just wondering if you guys did any integration with Oslo logging and the Cloud Native Fluent D logging tool. And also, how did you guys do monitoring? Did you guys use any Cloud Native Prometheus code at all? Or you did OpenStack non-CNCF monitoring? Yeah, so we work at Intel. And we use Snap for all the monitoring-related issues. So we even have a demo. And so multiple summits were saying, snap demo. So this is the way how we monitor. But I personally use Prometheus. And did Oslo logging work out of the box for you? Oslo logging, yes. But in most of the deployments, you use basically redirector logs to the standard output. The ELK stack, right? Yes. OK, thank you. And then you use, let's say, Fluent D to collect those logs from the containers or from the systems. Cool, thanks a lot. OK, thank you very much, everyone.