 Hi, okay. I think it's time to start. And thank you all for joining us today this day at Masakari, the project update session. And my name is Sampat. I'm the current PTA for the project and in this presentation, Tushar, who is one of the co-members of the team will join for the discussion. And those who are not familiar with Masakari, I will give you some brief introduction about the project. This project is about to provide VMHC service in OpenStack. And our main mission is to provide instance high availability service in OpenStack cloud, which is automatically recovering VMs from failures. So I'm not going to discuss further details here. And I put some of the links we discussed in the previous summits. And if you go there, you can find the full 40-minute presentation about what is the VMHC and what are the problems we have and what are the difficulties we try to address in this project. And also, we have user story defined and a lot of documentation work we have done in past summits and a lot of community discussions we also had in past summits. You may find those links in our wiki page also. So this is the big picture of the whole architecture of Masakari, which has basically it has three projects. The first one is the Masakari itself, which contains the Masakari engine and the Masakari API and all of business logics inside. And then we have the Masakari monitors, which do the monitoring part for the failures. Currently we have four types of monitors, which the first one is the host monitor, which is going to monitor the physical host service. And if it detects a failure, it's going to send a notification to the Masakari. And then Masakari engine will take care of the recovery part. And the second type of monitor is the process monitor, which monitors the important process for the VMs, like Ice Kaji Daemon or any other process you can configure whatever process you need to monitor in the process monitor. And the third one is the VM monitor. We use the QMM process to keep notifications to monitor the health of the VMs. And if something happens in the like, for example, IO band overhead or something wrong happening in the VM side, then we can detect those failures also. And the fourth kind of monitor is the quite new one, which has landed in the previous release, which we use the QMM guest agent to monitor the VM from the inside the VM, monitor the VM from inside. So any application can pass the notification to the Masakari through the QMM guest agent. It's kind of a pretty breakthrough for us, because since the beginning of the project, we had a policy to monitor the VM outside, just like black box monitoring. And since that feature land, everything has slightly changed. So now we have a feature to monitor the VM from the inside, which is also configurable. So if you want to have the old black box monitoring, you still have it. And if you need to look into the VM, so you still can configure it to that way, have some kind of a white box monitoring also. And so what happened in the previous release is we have two new projects, subprojects, which is the Masakari dashboard. So now you have a nice view with the horizon about what was the Masakari, all the configurations, and you can see the notification in it. And also we have the Ansible plug-in, for you can now easily install Masakari. Plus only one part is missing. We are working on it in this cycle. I will explain it in the next slide, what we are going to stay. And the last cycle we have new features that you can implement the workflow customization, then you can customize the required workflow. So for example, if you can put some action in between the VM vocation workflow, so everything is documented and you can find those documents in the Masakari documentation. And this is the introspective monitoring, which I was talking about the fourth kind of monitor we have recently. So we used the KMM guest agent to monitor inside the VM. And also we have DB-Purge support so you can purchase the software-related records. And also we were covering the whole, the community-wide goals also, the community-wide configs and other community-wide goals. And in this stay cycle we have plans, a lot of things. The first one is the notification feature, notification implementation. This is kind of like sending out the notification about what's going on inside the Masakari. So right now we have no way to know what's going on. Like when we start the workflow, it only shows us where the workflow is running or error. So this feature will give you a nice idea about what kind of works actually doing in the workflow. And I think Tushar will give you more details in the next few slides. And the second one is working on the ironic instance, which I'm going to try to implement these volume-based ironic instances and try to create high-availability instances using Masakari. And right now we're still discussing about how it should be the monitoring part and the record part is doing. And the fourth standard is most of the Masakari default is the pacemaker. We're depending on the pacemaker to do the whole, the standard and the fencing part. But with implementing this feature, you may be able to do this tonic via Masakari. So this kind of forcefully shut down node or if you find any fault action in the node, you can ask Masakari to do the fencing part also. And this is the Ansible missing part I was mentioning last slide. This is about the installing Masakari monitors. We have some difficulties to the configure how to install Masakari monitor with configure the pacemaker and course and everything. And right now we're working on it. And hopefully this cycle, we will be able to provide a full suite of Ansible plug Ansible to install Masakari. And plus, we're going to deprecate the Python Masakari client. And we do understand that we will not be able to totally deprecate it because we have some features that we're not using in the other places. So the definite thing is CLI will be gone. And instead of that, we can use the open stack client. So there will be no degradation of the features, but that part of the code will be disappeared in this plane. And we're also going to support the DevStack. And I think some of the Masakari monitor is missing and we're going to support that part. And also, currently we don't have any functional tests and some of our members are working hard to add functional tests in the Masakari and the open stack sticker side also. And the last one is the, we also have my personally working on the open stack resource agent, which is hopefully, if you're familiar with the pacemaker resource agents, some of the Adam's pairs were maintaining this in the open stack G-triple. So we have this discussion about having a new architecture for the open stack resource agents. So if you're interested in, you can join with the self-healing SIGs so we can discuss more further on this. So I would like to ask Tushar-san to give further introduction about the notification feature. Thank you. Yes, so in Rockerill is we have added Masakari dashboard. And in Masakari dashboard now you'll be able to see all the notifications which are currently running. But from the operator's point of view, this information that is displayed is not enough because it only shows the status as running. And some of the notifications could take a lot of time for processing because it all depends on what kind of, how many instances are running on that particular field, compute host, and whether the instances are booted from volume or images. If the instances are booted from images, it might take a lot of time, especially if the images are not cached on the compute nodes. So there is no information, enough information available to the operator to tell how much time this host failure process will take, I mean, will take to complete, basically. So we are going to emit a lot of useful information that will help operator to provide all that information. So before I just explain that, let me just quickly explain about what kind of recovery methods we support. So when you create a failure segment, you can specify four different types of recovery methods. So the first one is auto. In case of auto, basically NOAA will decide on which compute host the instances should be evacuated. The second one is a result host. So when you create a failure segment, you'll be able to add all the compute host that will be part of the failure segment. And when you add the host, you'll be able to specify like this particular compute host is a result host. That means if that compute host is a result host, operator will need to ensure that the compute service running on that compute host should be disabled. Then only they should add that host to the failure segment and they can set that as a result host. The second one is the third one is auto priority. So in this case, the host failure will try to evacuate these instances and the compute host will be selected by the NOAA. And if all the capacity is all exhausted, then it will try to evacuate it on the result host. So first it will try auto recovery method and then it will try result host. And the last one is result, which is result host priority, which is exactly opposite to auto priority. First it will try to evacuate on the result host. And if all the instances are not evacuated, then it will try auto recovery method. So in this slide I'm going to explain what kind of information will be available to the operator. And in this particular case, I'm just taking example of result host recovery method. So here you can see on the left-hand side that particular host is failed basically. And that particular node has eight instances. And it will basically try to evacuate this instance on this result host B. So the first task of that result recovery method is disabled source compute host. Because we don't want any of the, I mean since that host is still failed, there is a possibility that node could come up in between. So we don't want any other instances to be launched on that compute host. So we want to disable that particular compute service running on that source compute node. So that will be the responsibility of the first task, disabled source compute node. The second task is prepared HA enabled instances for evacuation. So in this task, we have provided a lot of options, config options to decide what kind of host instances should be evacuated. So there are three config options supported. One is instance metadata, HA enabled. So as a user, I can basically set, I want this instance to be HA enabled. So as an operator, when I will create the instance, I will set this metadata. And second one is ignored instances in error state. So on this compute node, if any of the instances are in error state, those instances won't be evacuated on the result compute host. And the third one is evacuated all instances. So it will basically bypass the first two config options. And it will evacuate all the instances irrespective whether the instances are HA enabled or not, or whether the instances are in error state or not. So this particular task will just prepare the instances list. That needs to be evacuated on the result compute host. The third one is evacuated and confirmed. So in this particular task, it will basically start evacuating the instances in batches. So based on all these three config options, here I've just depicted like only this two, three, four, five, six and eight instances will be evacuated. And once the evacuation during this evacuation, if any of the instance, any of the source compute host is part of the host aggregate, then it will first add that result host to the host aggregate that will be part of this evacuate and confirmed task. And after all the instances are evacuated, then it will enable the compute service on the result compute side. Basically it will enable the compute service of result compute host. And this will be the final picture that operator will be able to see it on the dashboard. So all the instances that are evacuated, operator will be able to see it here. And then if user hovers mouse on any of these instances, then they will be able to see the instances details. And over here also, we are going to add this link view instance action where you'll be able to see what kind of actions are performed that particular instance. Yeah, so this, this is already existing in NOAA dashboard. So here you can see that that instance is locked and then evacuation is in progress and unlocked. So all that information is shown here. And that request ID part is very important here. So if any of the instances are taking long time for evacuation, then with this request ID, operator will be able to see all the logs and figure out if there is any trouble. Yeah, so the previous slides, I mean, this information will be available on the dashboard. I mean, once you click any notification, then you will be able to see this graphical representation in one of the tab of notification. And we are also going to show this event details in verbose as well. So all the three tasks will be able to see here all the information like which host was disabled. And then how many the second task prepare instances will be able to see how many instances will be evacuated, how many are HNAble, non-HNAble, how many instances are in the error state. And in evacuated and confirmed task, you'll be able to see the current progress basically. Like if there are like 10, 20 instances running on this, on this compute node, then you'll be able to see all these details properly. So this particular view, we will give operator idea what is currently happening in the background. That is what we wanted to basically support in stain release. Yeah. So I'll pause the control to some things. Thank you, sir. And this is the last final two slides. And if you want to give any feedbacks, you may find us in IRC, OpenStack-Masukari. This is our official channel for discussion. And you may use those, so they have a mailing list with the Prefects-Masukari. And we also have weekly IRC meetings, and it's Thursday for UTC. And if you would like to join the IRC meeting, and you find out like it's kind of like a epic time zone right now. So if you wish to join and time zone is not much for you, please ask. We can change time zone and like find the in between position for both of us. And you may find the agenda also in Wiki. And we also have the Masukari Wiki. And there's a lot of information about the past discussion and the collaboration with other projects and so on. And especially for new contributors, this is the information about how to compute. We have two documentation, very nicely documentations about how to contribute to Masukari. But part of it includes the OpenStack how to contribute guide, plus some Masukari specific things. And you may follow the documents or just ask us and like start the conversation and we would love to help you out. Just know developing Masukari if you want to use it and if you want to integrate it to your cloud, just come to us and ask any question we would love to help. And the next thing I would like to introduce is the self-healing seek. This is not only about Masukari, it's about having self-healing in the OpenStack totally itself. So now we have a lot of members from different projects try to discuss about the self-healing scenarios. And if you have any problems or any questions about self-healing and related to self-healing, please come to the self-healing seek. And we will start the meeting, bi-weekly meeting from next week. And you may find the details in the wiki page. And there's a lot of discussions in there, including the Masukari. And I think you may find a lot of good information about the self-healing and resiliency and HHS staffs in the self-healing seek. So finally, thank you all for participating in this discussion. Thank you very much. So we may still have some time for questions. So please, if you have any questions. Yeah, hi. Several questions. Like, okay, the first one. So in your presentation, you showed that you are going to implement notifications. Yeah. And does it mean that you are planning to move, like, interaction of Masukari monitors with Masukari Engine via notifications that not via API? So, like, when, for example, host reads and fails, nowadays, Masukari monitors sends API call, API call, and then, like, sends notification to Rabbit and Masukari Engine processes notification. Won't it be easier to, like, that, like, host monitor or instance monitor, send just notification to Rabbit without this step with API? I think we still use the same notification mechanism. But this notification is different from the failure notification. This notification is, there's two types of notification in Masukari. The first one is the monitors and notifications in Masukari, like, including the failure details. So what Tushar will discuss, explain is the notification about how the process are running, failure processes are running, the state of the process. It's kind of like, when you have the same thing in NOAA, like, MQ, you can submit the MQ with the details of the ongoing task. So it's two different notifications. So we're not going to change, we're not going to change this notification model. Yeah, so Masukari monitors are still interacting via API, yeah? Yes. Okay, and is there any, like, reason why just not to, just send them to Rabbit as well? Yeah, so his question is Masukari monitors should bypass Masukari API and send that message directly to Rabbit MQ, so that Masukari engine will pick that up and process. Oh, this notification? Yeah, yeah. Yeah, yeah, yeah. Okay, so we don't have the feature at this moment? Yeah, but that is possible. It's possible, yeah. Why we have not done that? Because we wanted to, we wanted to also go from Masukari API. Yeah. Yeah, okay, and the next question is about Ansible plugin, which you were saying, as I know, actually, nowadays, OpenStack Ansible project has already Masukari role and implementation of configuration and setup of Masukari monitors has been merged to master, I don't know, five days before the start of the summit. So do coordinate with this project or you're just writing something own, just independent. For installing Masukari API and Masukari engine? Yes, Masukari API, Masukari agent at Masukari monitors. Actually, they are present in OpenStack Ansible. They were matched to Stein, like, I don't know, is it like five days before the summit? Yeah, okay. Sorry, me neither. Is that called Ansible you're talking about? Oh, no, no, OSA, OpenStack Ansible. Okay, I will check that out. All right, yeah, because we have some pretty difficult issues about installing OpenStack monitors, Masukari monitors, because we have a complete pacemaker and processing. If you think that one big issue is the difficulty part, it's not in mathematics, monitor is very simple, but we also need to check that we're pacemaker as well, right? Oh, okay, okay. And nowadays, we are trying to run Masukari on Rocky, and it seems that Python Client is broken for it. Yeah, I think we fixed it. Oh, already, yeah, okay, okay. Oh, okay, thank you. I think it says on prints, Rocky, and on masses. Oh, yeah, thank you. Yes, and that's, yeah, thank you. Great, just one of any of us. So if you don't want to have any further questions, I would like to conclude the session here, because I think we have the next session. Yeah, thank you very much.