 Musimy zacząć. Cześć wszystkim! Dzień dobry, witam na prezentację o cinderze, relabilitę i skalabilitę. Jestem Michał, jestem Gorka. Jesteśmy z cinderą kartą, fokusingą tego udziału. To jest tylko szkoda agenda. Słucham o tym, jak dostanie cinder. Wytłumaczymy, jak wyłączyć cinder HA, jak trzymać cinder w dół i jak trzymać cinder. Jak trzymać cinder w dół i jak trzymać cinder w dół? To jest tylko jedna z obawów architekturów cinderu. To jest inna seta serwisów, które trzeba wyłączyć cinder. To jest bardzo podobne do tego, jak nowa wyglądała w Falsom, ponieważ cinder był wyłączył z nowa w Falsom. Fasie jest bardzo proste. Fasie jest bardzo proste. Fasie jest bardzo proste. Wytłumaczymy, jak dostanie cinder w dół i jak trzymać cinder w dół? To jest jedna z obawów architekturów cinderu. Wytłumaczymy, jak trzymać cinder w dół i jak trzymać cinder w dół? To jest jedna z obawów architekturów cinderu. Fasie jest bardzo proste. Do climbing up. Wytłumaczymy, jak trzymać cinder w dół? To jest jedna z obawów architekturów cinderu. To jest jedna z obawów architekturów cinderu. To jest jedna z obawów architekturów cinderu. Tutaj jest jedna z obawów architektury. Po szalicach cinderu jest przejścieidayszać cinder, i ten porządkowitz jestweltkowski, normalnie na kontrolerach. I kiedy rysujesz lokalnych drzew, jak LVM lub NFS, które potrzebują akcesu do stowarzyszenia lokalnej na fizycznym stowarzyszeniu, rysujesz je w tym samym fizycznym stowarzyszeniu. I teraz zauważyłem Synder Backup Service, który jest opcjonalny. Wybieram Backups i to jest bardziej trudny serwis, ponieważ przed MiTACA to było spokojnie związane z stowarzyszenia do stowarzyszenia. I to było przed MiTACA. Musiałby być w tym samym fizycznym stowarzyszeniu. Po stowarzyszeniu MiTACA komunikuje się przez stowarzyszenie stowarzyszenia, więc teraz jest to kłopot i możesz poszukać stowarzyszenie stowarzyszenia do stowarzyszenia. Więc chcę pokazać, że niektórzy ludzie, kiedy używają niektórych rewolutów, jak Sef czy jakichś stowarzyszeniach do stowarzyszenia kontrolnej, który jest fajny, ponieważ stowarzyszenie stowarzyszenia stowarzyszenia jest podobne do stowarzyszenia. To znaczy, że przez stowarzyszenie stowarzyszenia do stowarzyszenia do stowarzyszenia stowarzyszenia to nie będzie w porządku. Więc jeśli to nie było dosyć zakończone, wytrzymam więcej kompleksacji i wyczynam, jak zrobić hA tych rewolutów. Więc synderapia jest jakbym prostym. Jestem statyna i hTP-protokoll jest statyna, więc możemy po prostu otrzymać podłożeniem rewolutów w hA i kombo z KIPA-LIFE-D i hA-PROXY will make sure that requests are passed only to the living synder API instances. There is one problem in pre-Newtons in their API. So we had some race conditions with statuses. And this is what it was like before in Newton. So in various place we had checks like that, so there's an obvious condition between getting the volume from the DB, checking its status and changing the status. So when running multiple synder APIs it may happen that two synder APIs were starting some kind of flows in parallel, which for example in this retyped flow isn't really safe. And that changed in Newton in most of the places to use conditional updates. So this is an SQL update instruction with multiple wares and moves all that logic into the database so it's atomic operation and should be safe. Sinder scheduler is even easier because you don't really need to have a load balancer because RabbitMQ or any other broker will serve as a load balancer. So we were just running multiple synder schedulers on different nodes and you have the redundancy that request will be distributed by in case of RabbitMQ in round robin fashion and there's one problem with that so these schedulers have their own information about the status of the synder volume services. So it may happen that if you are running out of capacity two schedulers will independently schedule a volume on a synder volume that can only have one other volume so only one volume will win the race condition and get onto the synder volume be created like the second one won't have any place left so it will be our schedule so that's the same problem as Nova scheduler has. Now I will hand over to Gorka, who will talk about synder volume. So the synder volume is different from API and the scheduler because it doesn't support currently active-active configurations. There are multiple reasons for this. First we require mutual exclusion in some of the of the synder core operation as well and in some of the drivers and currently we are doing it with Filux so we require a distributed mechanism here. We also have problems with the cleanup that is tightly coupled to our current architecture so we need a new mechanism more powerful that will allow us to prevent multiple nodes from the same cluster from interfering with the cleanup of other services in that same cluster and we also require a mechanism from the API to request cleanups to request cleanups from nodes that are down and are not accessible so other nodes in that same cluster can do the cleanup. We also need a job distribution that is cluster aware and we need to move operations that are also tightly coupled with our architecture so they can support this cluster concept. One of such is the replication operation that is expecting only to have one service providing the replication. There has been people deploying synder active active for a long time some of them just changed the host configuration option and set it to the same value in all the services in the cluster this was not good because whenever you added a new host or node to the cluster it will mess up all your ongoing operations same thing would happen if you restarted one of the nodes that were already there they would just break havoc in everything you had running cluster. There have been cases where this has been successfully done the active active but they basically remove the cleanup mechanism from the from synder and they either work with custom workloads or they use a shared directory for the logs. We expect to have a take preview of the active active configuration in Okata some drivers supporting it although some of these drivers may not support all the features for example they would require additional changes for the replication so they may not get it in there but we expect to have some. Once we implement all the features we are planning and doing for the active active we will be able to do advanced configurations like the one I explained there not much point explained it not that relevant but it will provide us basically we will be able to do any sort of configuration in active active the way we have solved these different issues I mentioned is for the distributed locking we are going to use two's which is an abstraction layer that will allow us to to have those distributed logs with whatever backend the deployer wants they may be using sukeeper at city console we don't care that's why we use two's to abstract from the current implementation and it will be using some operations in the sender core depending on how their backends work for the job distribution we will be using the same mechanism that the scheduler is using we will move the logic of scheduling to the message broker and we will create a new couple of message queues and exchange right now with this implementation the first approach will be that disabling services will be at the cluster level so you will not be able to disable services within the cluster you either disable the whole cluster or you don't disable but in the future we will allow independent service within the cluster to be disabled for the cleanup mechanism we will be using a new database table that will keep track of these of the cleanable operations and not only the operation that is going on but also who is doing the operation at each point in time this will allow us a fine grain cleanup mechanism that will allow us not just to say clean up this cluster or clean up this service but also we will be able to go to clean up only volumes clean up these specific volumes and things like that so let's see how the worker's table will work we will receive a request at the API level and as soon as we create the volume entry in the database we will also be creating a similar entry in the worker's table that will reflect the service that is working on it the API will not get a representation it will say no worker because it's not actually working on it and it will also reflect the status once it is sent to the scheduler for example upon reception it will reflect that it has received the work and it's actually working on it and same thing will happen when the volume receives it it updates in case at this point it dies we can know that it would have to do that clean up it will also get updated if a new cleanable state is set into the into the resource for example when you are creating and you change to downloading that's a new cleanable state we have a couple of cleanable like deleting, creating, downloading not all statuses are cleanable at this moment so whenever you change to one you will also update the worker stable to reflect this and once you complete the operation and you update to a stable status the resource you will be deleting the worker's entry completely from the database this is basically how we are going to be tracking it the sender backup service since we decouple it in Mitaka it already it supports now active active but it has a couple of limitations for example you only have one cluster for all your backups so you can only have one backend if you are deploying it in decoupled mode you also don't have a way to clean up resources that Node has been working on upon it's death you don't have an automatic way of doing it you have to go resource by the resource doing it manually that's basically but this works in active active which is good now we are going to Michał is going to go over rolling upgrade which is one of the features that will allow us to be up and running all the time I just thought that nie wszyscy są w tym, co jest mechanizm co jest mechanizm który na zimno volumny start widzi w sumie w wytwórczeniach czy w podróży i zmienia się status do zimno volumny więc to jest i to jest problematyczne bo jest na przykład że zimno volumny zimno zimno jest zimno jest zimno zimno okej więc pracujemy na zimno volumny zimno i to wszystko o kompatibilitności więc przez taki zimno powinniśmy uruchomić serwiski w różnych warunkach więc powinniśmy uruchomić kompatibilitność i tu są trzy sklepy pierwsza jest kompatibilitność db migracji czyli skimmy czy jeszcze pracuje z nową skimą db także musimy uruchomić komunikację między serwisami db migracji kompatibilitność i same goły na przykład na zimno volumny czy zimno zimno zimno zimno zimno to są kilka zimno które ma to łatwiej usłyszeć na zimno volumny i to jest opatulator i składnika i zimno zimno zimno nie możesz skipić zimno więc możesz tylko opatulować zimno nie opatuluj z zimno do Newtonu w żywym sposób możesz to zrobić zimno zimno zimno zimno zimno to znaczy możesz opatulować do żadnych komisów z zimno zimno zimno zimno jeśli chcesz to zrobić patrzcie na komisów zimno zimno i opatuluj co tam zimno zimno zimno zimno zimno zimno i oczywiście żywym opatulatorom jest możliwe tylko z ha bo będziemy zimno zimno zimno więc powinniście mieć ha trochę więcej opatulacji więc opatulowanie w kontaineru ma to trochę łatwiej bo wystarczasz się obawić hal także w zimno zimno jest wersja i detekcja na sierwisach więc musimy zazwyczaj żeby nie wszystkie rekordy są tam i jest opatulacja w zimno zimno i zawsze jest dużo zimno na jak opatulować od przedmiotu i tutaj jest opatulacja więc zaczynamy z ha zimno zimno w zimno zimno to jest to błogos i chcemy zmówić wsparcie zimno zimno zimno więc najpierw jest opatulować zimno zimno więc jak mówiłem że nie ma zimno opatulowania więc to powinna być takie zimno i potem możemy zacząć opatulować z odpowiedniej opatulacji więc zaczynamy z zimno zimno powinna być z tego z ha zimno zimno opatulować zimno zimno i zimno zimno będzie odzyskać w zimno zimno i będzie odzyskać wszystkie komunikacje aby być kompatible z tej wersji więc wyprzestawisz samej opatulacji z zimno zimno i potem idzie do zimno zimno który jest trochę łatwiej bo nie mamy ha więc w tym razie mamy w porządku zimno zimno zimno i nie zapłacisz zimno więc wyprzestawisz i rzeczywiście to zesk reinforciona tak, że wszystkie wersje obawione zнизując obawione i będzie się przy purchase syc franke embot i będzie czas, kiedy nie ma zbiorniców w formie zbierania, jakaś zbiorniców w formie w formie zbierania. Jeśli będziemy trzymać tym czasu, jak najszybciej, jak najszybciej niż serwis, zbyt duży czas, to będzie dobrze, ponieważ zbiorniców, skazowaniach nie zrozumieją, że zbiorniców jest niszczący, a niskie zbieranie do tego zbiorniców będzie zabijone, czy na maszku. Więc, kiedy zbiorniców jest jeszcze raz, All the messages will flow to it and they will start to be processed, so this shouldn't introduce no interruption. And this in the backup service, which is also a little different because backups are kind of long running operations. I've heard about several hours to make a backup of a very big volumes. So before upgrading it won't probably disable it first, so it won't accept any requests. Then wait a reasonable amount of time until it finishes all the requests. You can check the logs, for example, to see if it's still processing. Then proceed with sender scheduler, turn it off, upgrade it, and you should also enable it again. So disable enable is the service, disable enable command of sender client. And you repeat that with all the sender backups. And there are two more steps to go. So the first one is sender is caching the detected versions of services. So to make sure that the cache is invalidated and the version detection is run again, you need to restart sender API service and send signals, or restart if that's more convenient for you, the rest of the services. And then the version detection mechanism will be run again. And all the services will realize that they are now running X plus 1. And that's the completed upgrade. There's one more step that will be probably required in Newton to a cata upgrade and the next upgrades. So the same mechanism as Nova has with online data migrations. So we are limiting data migrations in the initial migrations to make sure that these are non-interruptive and they can be applied online. So there's a requirement that you need to upgrade to execute all the online data migrations before upgrading to the next version. So in that diagram it's X plus 2. And in this Newton cata case it would be before upgrading to Pyke. And that's the compatibility table. So it was impossible to upgrade Kilo to Liberty because we started working Kilo. And Liberty to Mitaka was manually tested, so we called it experimental. The Mitaka to Newton was tested on a CI, but non-voting one. I've called it experimental plus, but the CI was passing. It was okay. And Newton to a cata is supported and tested because every commit is now checked for aggregate ability. If it doesn't introduce any incompatibilities and doesn't break the upgrade workflow. And this is a voting job, so no commit can break it now, hopefully. I'll hand over to Gorka, who will talk about replication. Thank you. So I'm going to highlight just the replication feature in Cinder, which is only meant version 2.1 to take care of one use case, a very specific one, which is the smoke whole case, where your whole backend is gone somehow. And currently it supports multiple data replication sites, or disaster recovery sites, or you want to look at it. And we have two different types of drivers. We didn't want to limit or force backends into one way or another. So you can have drivers that do per volume replication. So you must specify, I want this volume replicated, and it can have both. Or you can have drivers that have per backend or pool replication. The failover is not automatic. The sysadmin has to go in there and say, please failover or to any secondary backend or to one specific. After failover, as we would all assume, the replicated resources will be available, but non-replicated will not. The way it works is drivers report back to the schedulers, if they are capable of doing replication or not, and the scheduler uses this information to match it with the request from the client, the API request, and schedule the volume creations according to whether you want them to be replicated or not. Once we do the failover, the driver, what it does is change in the database, the required information to point to the new location. Changes, depending on the drivers, they may need to do changes in the backend to promote the secondary, and the service gets disabled by default. You can re-enable it, but by default it gets disabled, so no new volumes are created there. While version 2.1 is fully functional, there are some limitations that some of them will be working on them to fix them in this release. For example, OpenStack is not aware once you do a failover, so you will need to manually reattach your volumes. That's a big one. You cannot force promote your secondaries. They are always in failover state. Some of the drivers don't support failback, some of them do. We will try to make them all supported in this release. We also have other inconsistencies among drivers that we will also try to fix. Another big one is that the freeze mechanism doesn't actually work. The freeze mechanism is supposed to prevent any changes in your backend. You should be able to attach volumes and use the contents, de-attach, but you should not be able to create, delete, or migrate. That was the idea. I don't think that it doesn't work. It only prevents the operations that go through the scheduler from happening. Now I'm going over some tips and tricks to make your cloud either more reliable or reduce the downtime. Until one or two releases ago we had a big bottleneck that was more quite visible. This could make in Cinder, for example, when you were under heavy load, you would get your attach and detaches. It could go from a couple of seconds to over two minutes to complete. This was a big thing. The problem was bottleneck in a contention problem with the number of threads and the number of database connections we had. This happened in almost all the OpenStack services because we had 1,000 green threads and only up to 15 database connections. If the service required a good amount of database connections, they would need to be queued and wait for others to complete. The solution was to reduce the number of threads to a smaller amount and to increase the maximum number of connections per service. In other, for example, it's only important for the conductor and the API, I believe, because those are the main ones that attack the database, but in Cinder all our services go to the database. So it's important in all our services. These are the configuration parameters that can be set to adjust this. I would recommend using Rally if you have a lab or something, you can actually characterize how your workload would affect your database connections. So you can use Rally and Connection Monitor, which is a top-like program that will display in real time all the connections from each of your services, so you can see how many are being used. An additional recommendation that OsloDB is already doing is changing from MySQL Python library to PyMySQL, because this one allows monkey patching, so we will be able to work better on threads. One basic thing we all want in Cinder is when you do a stop, you want it to be done cleanly, and you expect it to be done cleanly, so usually what this entails is you either only stop the service or you disable and stop it, and you expect this to work. If we had a cloud that is under heavy load and Cinder is getting a lot of requests and we want to stop the services, I will now see how stopping each of the services go. The scheduler, let's say you issue the system control stop, and you look into the resources and you see everything is fine. This should also work on the API, it works on the scheduler. No, you run it and it doesn't work. You see that there are resources in attaching and detaching status because the operation had failed, so what happened? What happened is that system D, when you request the stopping, it is sending the terminate signal to all the processes within that service and this is not what should be happening. We need to send it only to the parent process which will propagate it to the children properly, so we just need to change the service to set the kill mode. We try again and we fail, still more errors, but in this case less errors than before and we see that only when it takes longer than 60 seconds we see failures and this is because we didn't know about the database connection contention, so we had an RPC time out set to 80 seconds and what happens is that after 60 seconds the Cinder service has a graceful set down time out, so if after 60 seconds it hasn't completed it will send a signal alarm to all the processes and stop them. Recommendation always said your graceful set down time out greater than the RPC response time out, so you don't get in this situation, so with all this knowledge we try again and it still won't stop properly. So in this case this is a neverland whiskey issue that leaves for some reason some threats, idle in there. This is not a big problem because all operation will have already been completed. In this case if you leave it at default, the default is 90 seconds, so if we usually work with RPC time out of 60 seconds this will not happen, but in other cases if you are increasing it you don't have to increase this time out because if we have increased the time out of the graceful set down then and we request a stop, the system control when you request a stop it will wait 90 seconds, which is the default value and if the service hasn't stopped it will escalate to a kill signal, so it will get killed. For the API as long as the other values are lower it doesn't matter because it's only idle threats that are making it not stop properly. The center volume is different because it's not only in the control path but also in the data path, so it performs long operations like creating a volume from an image if the image is huge and migrating volumes. So the default graceful set down time of 60 seconds will probably not be enough, will be insufficient, so we have two solutions. We either increase the time out to arbitrary numbers that we consider enough or we just disable the escalation of the terminate signal to kill signal in the service and we check it manually. So to prevent it, we will set kill mode to known, exact stop we will send the kill, and the terminate, we will change graceful set down to a crazy number, like two days, it doesn't matter because we are actually going to check it personally if the service has gone down, so we request the stop, we wait a little while, we check it, or we can script this to check if it has actually stopped the process, if it hasn't. We can check with that SQL query which resources are still being worked on, we can check the logs and see if the operation is a stack and if it's a stack, you can just send a first stop at any moment. In summary, you have to check your services, you have to check your timeouts and be a little bit more careful. I would recommend setting those relations where the timeout of the system D is greater than the graceful set down, the graceful set down should be greater than the RPC, sync timeout. If you change this on a running, how the services stop on a running cloud, you can reload the demo, so the next time you stop it, they are functional and they will be properly stopped. Now, let's go into another thing that happens a lot of the time, you have your cloud running and it's in info level, the logs are in info level and suddenly you have an issue and you wish they were on debug level, so you need to change them, but to change them you need to restart the service. And as we saw, this is not a trivial matter and in the case of the sender volume it will probably disrupt, if you want to stop them cleanly, it will disrupt your whole service. So what we would like is a way to change the logs in less than a second, let's say, with as little disruption as possible, so I'm going to explain how to do it with GNU debugger, it's a kind of hacky way, but it works, so what you do is you ask GDB to run a Python script that will attach to the running process, which will stop it for a little while, like half a second or even less, and you run some Python code that it will change the log and you will detach from the process, this is not the nicest way to do it, but if the option is to stop your service for, let's say, half an hour, just to cleanly stop it so all operations are done in the sender volume, you may want to consider it, I don't use it, but I found it fun doing it. It can be improved, no, you can do sensible, you can set your overcloud to use GDB server, so you don't have to install the whole GDB, okay, and... We should probably skip this, go connect it to the Q&A, because we have, like, four minutes. All right, well, this is an easy one, this is just how sender detects that the service is down, basically the services report every 10 seconds and if they haven't reported anything after 60 seconds, they are considered down, and we can make it a faster detection, but we should be careful with transient database error and the increased load on the database that we may be generating from this. So, thank you, and if there are any questions, please, the microphone is there. I just need to show the slides the legal guys told me to do so. Can you explain how it works on replication when you have a volume on the primary site? Does it only replicate changes to the secondary volume or it replicates the entire volume? How does it work? Well, replication actually depends on the backend. Do you usually, the way it works is you have to configure your backend on its own, sender doesn't take care of configuring and setting the replication, the secondaries, so in the case, for example, in CIF, you have to peer the secondaries with the primary, you have to make sure that the journalist is enabled, you have to make some configuration so that they will get automatically replicated if the correct parameters are passed from sender. So, sender, all it does is when it receives a request from the scheduler that it wants a replicated volume, goes to the backend, the set backend and tells him, okay, this volume, please add journaling and make sure that it's being mirrored. So, I don't know if that answered your question. Great. Any more questions? I don't understand, sorry. Can you use the microphone? Yeah, I'm a little deaf, and this is not helping. Okay, I would ask about the HAA, automatic cleanup. You just talk about the cleanup can be done in other load in your cluster, yes? So, you mean on demand? No, automatic cleanup. Oh, well, yeah. My first implementation did include automatic cleanup, but I got so many minus twos that I decided to postpone it. Yeah, so nobody wanted it, basically. Everybody said no way. Not only that they didn't want to use it, they didn't even want it there. So, I will try to find that battle later on when all the other work is done. So, automatic cleanup will not be done in the next release? Sorry. I mean, automatic cleanup will not be done in next release? No, no, I don't think so. Okay, thank you. The concept is carrying a lot of people of the automatically cleaning up in a distributed environment and messing with the status of resources. I understand these people. Yeah, I actually said it in that patch to be optional. By default, it's disabled, but if you want to do it, you were setting, okay, I want automatic cleanup and you would set a time out, okay, don't start a cleanup immediately, give it five minutes, don't care ten minutes, whatever you decide. So, it would be, the schedulers would be checking, since they are already checking which services are this down, they would check it, check if the proper time has passed since the last heartbeat was sent, and they would trigger the automatic cleanup. So, is this implemented when you have to enable it? No, it's not. I implemented it, but I abandoned the patch. Yeah. So, it will be easy to get it back in there. You can store it. But, yeah. Just a simple question about replication again. I assume that there's, like, you have two sender back-ends, I'm thinking in context of Cep. You have two Cep clusters, you set up RBD mirroring between them. The user interface is like the user selects that they want this volume replicated. Well, yeah. And they have to have quota on the second? No. No, in... No. The quota is managed by sender and the sender quota is the normal quota for the primary. The secondary you have set, the CIS admin had already said that they wanted mirroring from one to the other. And in sender you don't add extra quota for having it replicated. What you have is a special volume type. So if you want to charge extra for that volume type that says that it's replicated, you can do it. But you don't count it as double or triple if you have three replications. You have a volume type which enables replication feature. It's on the volume type level. Yes, if you want... If you have different backends, one that are replicated and another that are not and you define your types to specifically say if they are enabled or not replication because you can say, okay, I have three... one volume type is replicated so I set in the metadata like force that is replication enable is true and in every other one you make sure that way independent of if the back end is per volume or per back end it will only send to the replication back end the ones that are meant to be replicated. When you're updating an existing installation can you enable replication on existing volumes? You would need to retype the volume. You need to retype. You can set the quota for the single retype and then charge double. Okay. Thank you very much.