 So, hello, everyone. So, today, we're going to talk about the OpenStack cluster zero-time upgrade and featuring in the Kola project. So, my name is Hieu, and my colleague here is... My name is Duong. I am software engineer from Fujifilm, and I'm also a Kola reviewer. Yeah, so you can ping us in our... Here we are, our contacts are on the screen here. We are email on an IRC. So, this talk will give some reviews and thoughts about the zero-time upgrade for OpenStack service. And please note that this is just an idea... This idea showing in this presentation is just a concept, and hope that in the future we can implement our idea in a native OpenStack project. And we have some PLC and demonstration via Kola project. And here on our line, firstly, we come through some concept about the OpenStack upgrading. And the second thing is our approach to achieve the minimal time to zero-time. And the last thing is some zero-time upgrade proposal in the Kola environment. We have some demo, and we show some live demo while upgrading in Kola project. And we have zero-time while comparing with the normal upgrading process. And we have it in our video demo. And so this is a live demo, so hope that is okay. The first thing here is about OpenStack upgrading. So this is one of the most demand features for every system, and nowadays the service license agreement is more strict. So after it's released, we need to decrease the downtime of the system. We cannot achieve the upgrading via the co-upgrade, but this is very hard to control the downtime, and the co-upgrade is very easy, but we cannot control the downtime. So here come the two upgrading models. The first one here is the blue-green deployment and canary release. And the second one here is the rolling upgrade, which are implemented in some OpenStack projects. So in the screen here, we have two clusters. The first one is a cluster running the old versions of our service, which are depicted by green color. And the second cluster here is blue color, that running the new versions. So blue-green deployment just suite on the request from our cluster to the new cluster. So this advantage here is a high cost hardware requirement for switching and preparing the infrastructure to upgrading. So the canary release is an advantage model of the blue-green deployment. So we just need to suite a few percent of user requests from the old cluster to the new cluster here. And if our new versions of service can work well, so we will suite on 100% user requests from the old one to the new one here. So here come the rolling upgrade. Rolling upgrade model in OpenStack can try to eliminate the need to restart the old service on new costs simultaneously. So we have three services, A, B, and C, all running in the X version. And while upgrading from the A service and the B service from X to X plus one, we need to ensure that the A service and the B service in the new version can work well with the C in the old version. So it may have some doubt harm in while upgrading. And we come from the User Story Product Working Group, and this is the eight requirements for performing rolling upgrade in OpenStack project. The first one here is the online schema migration for which supporting the project can migrate the data and we perform upgrading the schema OpenStack project. The second thing here is the maintaining mode that we fancy on the request from the user while we upgrading the service. And the third one here is the line migration and if we try to rolling upgrade the data plane, for example, the Nova Compute service, so we need to ensure that on the VM running in the node, can line migrate to the other node. The fourth one here is the multi-version interop, which I mentioned in the last slide, that on OpenStack service that communicate with our IPC you need to work while we're upgrading from the old version to the new version. The fifth one here is the great full shutdown is that if one upgraded service processing some user request, we need to ensure that the current user request to go into the upgrading process. The sixth one here is the upgrade or gestation that we can automate the upgrade process via some tool like NCBO or HEAT. And the seventh thing here, the upgrade thing that we try to ensure on the new patch upload to OpenStack to code can work well and do not break the upgrading process. The eighth one here is the project targeting which DC propose for provider enough information for operator or end user to see how many projects in OpenStack that support rolling upgrade. And more detail in the upgrade access and start, the DC of OpenStack defy upgrade-related task. The first thing here is support upgrade. The second thing is support-accessible upgrade. And there are some OpenStack projects, almost OpenStack projects that support upgrade, but there are no projects that can support the accessible upgrade because they're very hard to test the data plan of the OpenStack project. The third one here is support-rolling upgrade. There are a few OpenStack services like Nuva or Neutron that now support the rolling upgrade. And the fourth and the fifth here is the zero-dial time upgrade. We can ensure that while we're upgrading, there are no dial time, but we can accept some a little bit of delay while in the user's response. And the fifth one is zero-impact upgrade that eliminates on the delay from the user perspective. And I'll talk targeting the fourth one here that supports the zero-dial time upgrade, but not native in the OpenStack service. So from minimum dial time to zero-dial time upgrade here, we need to enhance two things. The first is the configuration management, why we released a new version of the OpenStack project. We can have a new config. We can have the deprecated config or removed config after the two releases. And we need to pin it in the config file for the RBC communication. And the second thing here, we need to enhance the DB migration or another term here, online schema migration. So while upgrading the database, we need to ensure that all the process of the database upgrade cannot break the work in the old service of OpenStack. Currently, there are two main approaches for online schema migration. The first thing here is the trigger bay that Keystone and Glance are implemented in the last cycle. And we also saw that Facebook uses something like trigger bay in the reference link here for doing the upgrading of Facebook service. The second upgrade here is the trigger list as the Nuva and Neutron are trying to target. And another upgrade here is the binary lock bay that we will talk about later. So for the database online schema migration, so there are two candidate solutions. The first thing we propose is that we need to buffer on the request in the upgrading period. And the second thing, we utilize the checkpoint and snapshot of a binary lock of the database. This is our proposal that in the first step, there are a normal OpenStack cluster. And in the first step, we try to upgrade the OpenStack cluster. We turn on the request buffer. This can be an in-memory request buffer. And this will hold on the user's request in this buffer and tell the user that please wait a few minutes while we're upgrading the OpenStack cluster. And after upgrading our service from the X version to the X plus one version and the db-rayer from X version to X plus one version, we release on the request in the buffer and ensure that the request is resend with the original auto. So there are two kind of requests. We need to buffering the first thing is the RESTful HTTP API request from the user and the mid-wind interproject and the second thing we need to buffer here is the RPC request through the message queue layer. And we need to ensure also that the only request that we put in the buffer need to be resend in the correct order than best with the timestamp. So this can have an advantage that from the user point of view there are no service-perceivable downtime if the migration time is short enough. But it comes out with this advantage that if the migration time is long, mainly by the database migration process, some requests can come out. And the system will last when the requests are queued in the buffer. And for some requests, we see that there are some uprights from the Keystone project that allow to expire token so we can reduce the expired token while resending from the buffer. So for trying to resolve the buffer can be very large database while upgrading and migration data will come up with the next proposal. So every open stack database has a feature called binary lab. So we turn on the binary lab and on the transaction will be held in the binary lab and then we create the checkpoint turn on the binary lab. So there are some depict here that on the new data will come in the upper layer of the red color and on this new data will be recorded by binary lab and we set down on the database related service and read the binary lab we migrate the data to the new version of DB. And after that we turning off the binary lab and come restart on the service in the new version. So there are a little bit of time here and this can be eliminated with the previous approach. So the advantage of this proposal is that the internal system is very smaller comparing with the previous approach and this advantage here from user point of view is very short downtime but the implementation is more complicated than previous approach. For example, not open stack service using the same DB mechanism that can support the binary lab. So our live demo will come up with the first approach. So for conclusion we can combine the two candidate methods to get advantage from the both methods. So the first step here we create the checkpoint and the binary lab and then we migrate the current DB to the new version schema and then we turn on the buffer for holding on the user request then we migrate the new data to the new DB and then we turn off the binary lab and we finish. So we hope that this can have a zero downtime from user point of view and only a bit lag while user sending some request and have a response here. So here comes the demonstration in Kola and my colleagues will help you. Thank you everybody. First I have some words about Kola project. It is in B10 in B10 of open stack and submission is provides a product ready container and deployment tools for operating open stack cluster and I do not see any colleagues here so thank you for bringing Kola to open stack it's a very nice project and very nice team currently we have three deliverable the first one is Kola it provides a Docker image for every open stack service currently we have own service in B10 has immediate Kola. The second one is Kola asable it is the asable playbook and supplement tool for deploy open stack cluster and the last one is Kola Kubernetes it is the same mission as Kola asable but the deployment tool is by Kubernetes and HEM the Kola asable is in but Kola Kubernetes is in active development so we have some company use Kola asable in production but previously we have we need to enhance for zero time upgrade and for the first one is configuration management and Kola asable support configuration management by implement and mechanism configuration overridden it is use future of asable and some of our patent script Kola Kubernetes post boot potential to automate automate configuration management you confirm map and as a result and catcher of Kubernetes and HEM and active development so it not have complete future and for the online schema migration for open stock service support online schema migration natively we must implement mechanism in Kola logic and I have blueprint here you can see the red one so online service upgrade for your schedule for the project would not support OSM natively we can implement above idea high availability and load balancer layer and for the request buffer in demonstration we use intermission in open REST maybe not many people heard open REST or intermission in web framework but it is extension of engine and intermission is lower plugin for open REST for buffer in the request and it is open source for the PLC we have some component here you can see we put intermission be in front of adiproxy from adiproxy adiproxy is part of Kola deployment model and the keystone is only a sample in Kola deployment adiproxy is in front of open stock service and the intermission here is for buffer in the request render upgrade is third plate we also have some component here the first one is header for the request go out intermission and the second one is I change part of some service is in upgrade I we also have one scenario here is continuously send 200 request is create and release to and we upgrade to the master cost bay why the upgrade is send for demonstration we wrote some script and you can get it as a github also we record to video as the upgrade without time is video I hope is kind of source let me publish and we have four pan here the first one is the first column is where we run script and the second column is right we tell the log the output and the error log of the script will send the request I choose this I'm sorry the first side is a bit small you can see before the upgrade is third plate the output is is right out and the upgrade is task because we cannot increase the true cluster for not buffering and the buffering version so we record the not buffering versions and show it in the video and in the next we do try to live demo in the buffering version so in the last here is the error log from open stack cluster we do upgrading the neutron from Newton to the master so currently there are no errors can you skip? maybe I will skip a little bit it's still fine but yeah you can see that I you can see that the neutron server is in upgrade so it takes back for a while and the request is sent to the neutron there is an error here in the bottom right pan many errors come back with the live demo something wrong? sorry the VPN connection is time out I must use VPN to connect to my cluster and the connection is time out please open the video for that so in the meantime I will play the second video and if the connection is bad first I will provide some live demo so pan is now all the same as the previous video I am sorry for the connection it's gone talk a little bit so in the demonstration we only buffer the HTTP request so we do buffer it in our research okay so COLA community also post many interest in rodent upgrade and zero time upgrade but so here is a test with environment that is deployed by COLA and we put a new layer calling OpenRxD with the intermission and before upgrading the COLA the open stack cluster deployed by COLA from the new turn to the master version we turn on the buffering request and then start upgrading the cluster and continually sending the grid and delete network to the new turn API server so in the upper right here is the response from the new turn and similarly with the previous scenario in the lower right here is an error loop from the new turn API server so if there are errors that will show up here mostly from the HTTP code of 400 more than 400 and less than 500 connection back connection back if somewhere it seems that teamwork does not like VPN I don't know why the teamwork with CMD does not like VPN so okay, please come up with the video so we cannot connect to our system via the VPN connection in some network something so please come up with the video you can see that the neutral server the neutral server is finished upgrade the phone is missed more here but here is in the first in the top left pan it is the three yellow line and it puts up the neutral server it's finished upgrade and you can see that in the bottom right pan all errors is speed up but the up put pan is still fine but it lags as he said before due to the buffer the request it hold in the buffer and the request is continuous it's still sender can you skip to the last part for the final reason sure it's very long very very long process progress here so the upgrading of the neutral server from old version to the new version is very long because we need to wait for the neutral server in the new newest version to come up and take around 5 10 minutes and it is it is a case that our neutral server that the body has many data in it and it's at that time and in the production when the data body has many many data the migration progress is much longer much longer than it so if we do not have the buffer layer on the request it fail skip a little bit here the upgrading progress is finished in the first part and the request is continue process you can see the output is come back here but no errors can you skip to the so output is continue so in the first top left panel you can see that the upgrading process is finished and the line in the lower left panel is the request finished sending the request for creating and deleting the network and after finishing the upgrading process the buffer now release on the request that in the buffering and here you cannot see the error from any error from the request of the neutral API server can you continue you can see that the request on the request is sent and the output is near the end of the output we index the request that you can see here in the request that creating and deleting the network number 1, 2, 3 and then in the in the top right panel here you can see the order of the buffer request continuously increasing from 1 to 200 so that is finished so very sorry for the VPN connection but come back to this slide so we have the video demo here and sorry that we cannot create a live demo because time out but you can try it by yourself we provide the script we use in the demonstration even the second VM for the testing is put here and I get the idea from the collaboration project to set up a VM from the request you can in the our demonstration you want to for the host OS and you can use our script to set up this for collaboration deployment and everything else is provided by collaboration so that's our presentation do you have any questions if there are no questions okay so thank you for coming here and very sorry for cannot have a live demo so thank you thank you very much