 Hello, stackers, and thank you for coming to this session. In this session, we are going to be talking about the operation of OpenStack-based cloud service, and which we are providing to the customer for more than one and a half years. And especially, we are going to be focusing on handling the multiple alerts and the automation of operation, some of the variations. And the presentation title is how to manage multi-location, multi-component cloud services, take control of your alerts. And the presenter are Shohei Okada, myself, and Takashi Tazoe. Firstly, let us introduce ourselves. Hello, my name is Takashi Tazoe. I'm a system administrator at NTT Communications in Japan. And I have been with this company almost four years. And I have engaged in OpenStack-Cloud project almost one years. I take part in our virtual and physical server and the network management and also operational automations. And I'm Shohei Okada, and I'm working with this project for about more than two years. And I'm in charge of virtual server and bare metal and storage operation. And I'm the server operator. So let us go on to today's session. Here's the agenda for today. Firstly, we are going to talk about the introduction of our company and our cloud services. And after that, a basic operation flow and program. And talking about the measures about alert control and auto remediation. And after that, we're going to make a demonstration and talking about the conclusion and future task. So let us introduce about our company and our cloud services. Our company, NTT Communications, is a group company of NTT. NTT is headquartered in Tokyo. And one of the biggest Japanese telecommunication company. And NTT Communications ourselves is headquartered in Tokyo. And we have offices in more than 40 countries and regions. And more than 20,000 people are working there. And we have several services, such as global cloud or data centers or VPN and global tier one internet backbone and so on. And we are going to focus on the global cloud that is based on OpenStack here in this session. So let us introduce about our global cloud, which name is Interplus Cloud. It's based on the OpenStack. And it's provided to worldwide. So we have eight regions all over the world, such as United States or United Kingdom or Singapore, Germany and Hong Kong and Australia and Japan. And we have two operation centers. One is in Tokyo and one is in Mumbai. And global affiliates are working with the customer ticket, such as entity America, Europe, and so on, are working with a customer, handling the customer ticket and answering the inquire from the customers. And our cloud is connected to our co-location services called Next Center. And it's also connected to the third party clouds, such as AWS or Amazon Azure services. So for better understanding, here's the production image of our enterprise cloud. And we are providing the virtual server in the bare metal and storage and network and so on in the enterprise cloud. And it's also connected to the co-location service and also connected to third party clouds, such as AWS and Azure. And we can manage and check the resource usage or cost from the cloud management portal that we have been providing to the customer. So let me go on to the basic operation flow and problems. So here's the very basic operation flow. Firstly, we'll detect alerts from our integrated monitoring system. And after that, create the instant ticket and conduct root cause analysis and conduct remediation processes. So I will pick up these steps and I will make a detailed explanation and pick up some problems from now on. So as for the alert detection step, we are using the integrated monitoring system. And devices are monitored with each other. So the problem is that we might receive multiple devices if the multi-connected devices get down status. So for further explanation, this slide, I will use this slide. So in case of maintenance, and suppose that server C in this slide is shut down or down status, so we will be getting the alerts from, of course, from server C, but also ABDE devices as well. So in case of the maintenance, it's easier to identify the target device because it's predicted. But in case of outage or failure, it's going to be more hard to identify the root cause because we are just receiving multiple alerts from multiple devices. And we need to take some times to identify the root cause and check some, some statuses. And after that, we will find out server C is down. So it's a program in this phase. So let's go back to the operation flow again and go into the create instant ticket phase step. And we are creating instant ticket by our own ticket system. It is used, it can be used for customer communication as well. And we write the alarm information and work logs we have conducted manually to these tickets and update to the customer or internally. And after that, go on to the root cause analysis. We follow the runbook and checking the status or run jobs or scripts. And sometimes we conduct tests, such as VM creation or something like that. And after that, go to the remediation following the runbook as well. So restarting the process or rebooting the server, VM migration or hardware replacement is included in this step. So let me raise one example for better understanding for this operation flow. So let us suppose that one of the hypervirus got hardware error, such as memory error like that. So in that case, we will be receiving sign of failure in the integrated monitoring system. And after that, we create instant ticket manually and update the information of the alert information or work log there. And after that, go on to the checking hypervisor status and confirm the issue. And after that, if possible, we will conduct live migration of the VMs on that server to other servers, normal servers. So and then, if the hypervisor is without VM, we will shut down the server and go on to the replacement. So here's some problems, because we need to take several steps, four steps, before we conduct remediation, such as live migration. So if the operator is working with some other incidents or outage, and if the troubleshooting has been delayed, during that time, there might be a risk that the hypervisor with customer VMs get down. And if the hypervisor with customer VM go down status, of course, VMHA will be happened, and VM will be migrated to other servers. But it's accompanied with VM reboot. So it means the customer impact will be occurred. So this is one of the problems, because we need to take some actions manually. So for these problems, from the next slide, Tazui-san will be going to talk about the solutions for these problems. Yes, thank you. And until now, we have talked about basic operational flow and the problems we have faced. And from this slide, I will talk about the measures for these problems we have taken. First thing is measures for maintenance-related alerts. The main problems in this chapter is numerous alerts caused by a frequent code update, deployment, and device maintenance, and some kind of operation in production, and in parallel. Recently, development tool improves very much, and the code update deployment number become larger and larger. And it is a very good thing for agile development. But from operational side, it means that it becomes much complicated to connect maintenance information, maintenance, and alerts. And this complexity takes a lot of time for operator to find out how these alerts are maintenance-related or real alerts, and the remediation is required or not, and the system status is normal or not. So we needed to take an action for these problems. For these problems, we prepare a function that had maintenance-related alerts by analyzing configuration information and make a correlation of maintenance and alerts. Let me explain it with this slide. At first, we gathered configuration information to one server. And this configuration information includes device connection information, such as device A port 2 is connected to device C port 1 and some kind of information. And in addition to that, we gathered hypervisor VM relation and component and service relation and other information, which is required to analyze a correlation. And when maintenance is scheduled, operator input maintenance information to this function. The information contains a maintenance target and a maintenance duration, and what kind of operation will be due. For example, device reboot, process reboot, and link connection change, and configuration change, or something like that. And when input is completed, this function starts to analyze the configuration information and create an outage-related alert patterns as an output. This output works as a filter over our integrated monitoring system. It means that these pattern alerts come to integrate our monitoring system. These patterns are hidden from operator's view with this function. In this way, operator can only concentrate on the real alerts and they can concentrate on the remediation of them. A second thing, a measure for multiple alerts from outage. The problem itself is almost similar to previous one. When outage happens, multiple alerts will be occurred from the device, I mean outage device, and the surrounding device, too. And it makes operation complex. So we need to take action for that. And the difference between the previous one and this one is the productivity. In maintenance, operator cannot watch and when and where will be happened in advance. However, outage cannot. So we need to prepare some function that when outage happens, analyzing the correlation between the outage and the alert's relationship. For these problems, we set the trigger alerts and the predefined analyzing pattern for typical outage such as device down. Let me explain it with this slide. And when device C goes down in this slide, a pin energy alerts and other alerts from device and link down alerts from device surrounding. A surrounding device connected to device C will be detected in our monitoring system. And we set trigger alerts that sub are down for pin energy. So in this case, a pin energy from device C is a trigger alert of this outage. And when the trigger alerts come to integrated our monitoring system, this function are detected and starts to analyze the configuration. And this function create output as our main outage-related alert's patterns as the previous one. And when calculation is completed, this function starts to mark them as a low priority. In this way, a trigger alert is highlighted and the operator can only concentrate of the remediation of the trigger alerts. And I have prepared regarding this function of the demonstration. So let me show you. This is our, oops, sorry for that. I'm sorry for kept you waiting. We open our file. Really. Well, it works well. I skipped it. And currently, I shut down the device as an outage simulation. And when shut down, oh, when shut down is completed, multiple alerts will be detected in our monitoring system like this. These alerts are caused by one outage. At this point, it's hard to judge that what is the root cause and what should be done by operators. So in the background, this function starts to calculate related alert's patterns. And when calculation is completed, this function starts to mark them as green. It means low priority in our operational rule like this. Like this. At this point, operators know that what is a root cause. A root cause is highlighted pink and the other is green. So they can focus on the remediation of this root cause. And in addition to that, in the background, this function also created, sorry, instant ticket that reports this outage. And furthermore, the ticket number is written in the operator note of each alerts. And in this way, without this function, when outage happens, operator needs to take an action to find what is the root cause and what is the related alert, a relationship between alerts, and create an instant ticket and write back a ticket number to alerts. However, with this function, operator knows what is the root cause and the ticket is created automatically. And the ticket number is connected to each alerts. So operator only focuses on the remediation and starts to remediate immediately and record operational log to instant ticket. Instant ticket. So, and something is approached for auto remediation. Previous two measures is mainly focused on alerts control to make manual operation simple. This is an approach for automation itself. And at first, we would like to talk about the basic operation policy when we consider auto remediation. When operation is simple enough, we consider auto remediation first, and especially a major outage that means a frequent outage and a large impact outage. On the other hand, when operation is too complicated or too risky for automation, we consider the documentation, I mean create a roundabout, or create a support tool for manual operation, such as database corruption or something like that. And when this operation becomes simple enough through this process, or the impact to become larger due to situation change, this operation becomes the next candidates for auto remediation. This is our basic policy for auto remediation. And I prepare auto remediation patterns today at demonstration. Before that, I would like to introduce the overall view of today's demo. In our demo, hypervisor minor error, which doesn't have a service impact, will be detected. And this alert is a sign for a major outage that causes customer impact, or the BM, on hypervisor. So we need to avoid it. To avoid it, our auto remediation function starts to live migrate all the BMs on outage hypervisor to normal hypervisor, as a preventive maintenance. This kind of preventive maintenance is very important to make high ability service. So this kind of operation should be done quickly and correctly. So I prepare the demonstration, so let me show you. I wish it worked well. This is our monitoring system and cleaned up. And also, ticket system is also cleaned up. And this is a current status of hypervisor. This is a 3BM hypervisor. And I create some kind of file to simulate minor outage. And when minor outage happens, auto remediation functions go into the pre-confirmation phase. It means that what is the current status of the outage hypervisor and find a normal status hypervisor for live migration. And this is a status of each hypervisor. Upper half is an outage hypervisor status. And the bottom one is a normal hypervisor status. And the current pre-confirmation phase, so that nothing happened. However, when pre-confirmation phase finished, this function go into the migration phase. And when migration phase, several VMs on the upper side become the status, become migrating, and move to the bottom with active status one by one. It takes several seconds or several minutes. And the time of live migration is due to the VM size or VM usage status. So wait a second. Accurately, second one become migrating and move to the bottom in seconds like this. And last one, and move to the bottom. It means normal hypervisor. At this point, all the VMs on outage hypervisor move to normal hypervisor automatically. So when live migration is completed, this function move on to a post-confirmation status, confirmation phase. It means that all VMs is really moved from outage hypervisor. And all the VMs on normal hypervisor works well, really, correctly, and so on. And in addition to do that, same as previous demonstration, incident ticket that report this incident is also created. Previous demonstration doesn't have any operation in production servers. However, in this demonstration, I did. So when post-confirmation phase is completed, this function attach all the operation log into incident ticket. So I skip a demonstration for a while. But I tried to show you the result, but it doesn't work. But in the demonstration, operation log is attached to incident ticket. And operating can start to operation. Operator can understand the before hypervisor situation and the operation itself and the post-situation of hypervisor with this log and start to take an action based on that if other action is required. So that's all we have problems we have faced and the measures we have taken. So I'd like to conclude our presentation with future works. Conclusion, we encountered too many alerts in production that is due to the large scale and complex cloud service. And we monitor, to monitor the large scale and complicated monitoring system, it is difficult to monitor it with one MMS. So we use much kind of monitoring, such as, for example, a service level. We use Jenkins or script, our own script monitoring, and the resource and process level, we use Zavix. And hardware level, we use vendor default functions. But in this situation, it's difficult to keep monitoring item disjoint. So we gathered these kind of alerts into one system and make a correlation out there. And mainly, when considering about the correlation, we mainly focus on the maintenance-related alerts and outage alerts. In addition to that, we automate the operation, some kind of operation, or automated. Future work, there's a lot of work to be more sophisticated operation. First thing is expanded remediation, patterns, but keep monitoring system itself simple. To do that, we need to prepare some kind of framework to gain a maximum benefit with minimal efforts. And also, try to keep the monitoring system itself simple, like keep monitoring item disjoint and monitoring way, simple monitoring way. And second is monitoring setting automation. Currently, we set monitoring slash short by hand, but it is difficult to decide the appropriate monitoring threshold. It means one second is OK, two seconds is OK, but three seconds is energy, or something like that. And inappropriate threshold may cause too many misalerts or, on the other hand, undetected outage. Furthermore, recently, a deployment number become larger and larger. So system specification change day by day. So we need to automate, or it is required to automate it to catch up the situation change. Third thing is preparing enough troubleshooting functions. Today's our presentation, all the major problems and major are predefined workflow. However, there's a lot of unknown issue in production. So for that, we need to prepare some tool to simplify the root cause analysis or log correction, a log correction function for simple troubleshooting by hand. So we will keep working on these kind of topics for more sophisticated operation. Thanks for listening. This is our presentation, all. So I'd like to move on to Q&A session if anyone have a question. Thank you for listening.