 this time. So let's get started. Wow. Thank you for coming joining the presentation. During the presentation, we want to share a lot of important information and tools we developed through the year operation. On behalf of our group, we have three presenters. My name is Ken Garash, and I'm reading the project. And here are also Asako Ishigaki from Entity Software. Also Akihiro Motoki, some from Nishi. So this is a timeline of our project. As you can see, the project just began in June 2014. At the time, we have almost eight people, not fully dedicated, but total eight people were there. And at that moment, we didn't have any knowledge about the operation of OpenStack. That's why we went to 100 servers and did a many scalable tests. And then the team is growing. And also, we did many tests, like a recovery test. And finally, we got 14 people. And now, from this May, we're providing 24 by 7 support. So our team member, highly scalable people, but still, we have a few rules. Sometimes we call it culture. This is important to achieve a higher output by the team. First one is we are just focusing on using OpenStack instead of developing OpenStack. Our team members' numbers are highly limited. So now, we are just focusing on using OpenStack, not developing OpenStack. And also, human resources is the most important asset for us and highly limited. That's why we are always thinking of reducing OPEX, promoting automation. So anything that a human need to do more than twice must be automated. This is one of the key principles of our teams. Also, we actively support HA and also introducing self-healing. This is the tools we are using. Most of the tools for deployment and operation, we are using Ansible and Python. And some tools are written in shell script. And this is our CI, CD environment. Not much difference to the community. But using this CI, CD environment, we've already created more than 2,000 patches in 2015 this year already. Also, we are deploying more than 200 patches to the actual production environment. And we talk about our operations. This is very brief overview of our OpenStack configurations. Actually, they are not different from the architecture I presented at the Paris OpenStack Summit. So if you have interest, please check the URL. And here is the basic ideas. We put more than double redundancies for hardware. And we put at least double redundancies for software. So we support HA. So the operator doesn't need to go to the data center or operation room if there is just one failure. You can just take enough sleep over the weekend. And on Monday, you can go to data center or operation room of a table. And then you can fix it. This is our deployment tools. So once you purchase the hardware, then operator registers hardware information to the CMDB, this kind of CMDB. This CMDB holds information about hardware information and the location of the locks. Also, it has network information as well. And the interesting thing is you can specify load of the hardware by using this CMDB. Actually, this is just a list of our Ansible playbooks. And you can pick any playbooks, whatever you want to assign to the hardware. So in this example, you can see you want to apply a normal node that is a common playbooks for all the servers. And also, you want to configure this hardware as a Nova Compute, which supports special flavor called Nova Compute Standard 1. And based on this configuration, our Ansible leads those tags and create this kind of inventory. It is called dynamic inventory in Ansible. And as you can see, this normal node appears in here. And also, you can see Nova Compute Standard 1 appears here. So once operators configure selects playbooks, then those playbooks are automatically applied to the hardware. So using this CMDB, we can understand the configuration of the hardware and the history of the hardware failures, also loads of the hardware. We also actively developed many tools. As for the deployment, we have playbooks for network configuration, account, logging, service, drivers, and total, we have 37 playbooks. And this is an example of the most difficult one. And in this setup, we need to create a compiler hard disk driver. In this playbook, first, it goes to the cloud and launching a VM and install kernel development tools like libraries. And using the VM, we compile the hard disk driver and also do firmware update. And finally, we install the kernel driver and create a file system. All the complicated procedure is done by just one Ansible playbooks. Also, we have playbooks for OpenStack. Total, we have 62 playbooks. And using this one, we can configure HA like I mentioned before. So as for the operation, we are actively developed. Now, we can take usage for building purpose. Also, we can migrate bunch of VMs and backups. Also, we can manipulate bunch of users. Also, we are developing tools for health check. This is an example of our tools called per host instance check. Using these tools, you can see, you can specify the Nova compute and you can check whether a VM is booting up, also taking a log, getting a network, getting metadata, can log into the VM and SSH. All the things are tested. So if you make some modification to the Nova compute, then we run this tool and check whether the node is configured correctly or not. So today, we have this tool, so we can update our environment very aggressively. So if you want to know more about operating tools, we have another talk at 4.40. The room is here, so please come. Next, I want to talk about our monitoring system. This is our monitoring systems. We are using Xavix for detecting real-time alert. Also, we are analyzing all the logs using elastic search. This is for detecting future bugs and also we use it to detect some malicious activities. Let me talk about Xavix configuration. So right now, we are monitoring about 2,000 items. That is for general stuff like memory, CPU network, hard disk usage. Also, we have simple self-healing mechanism like process restart. As for the open stack, actually, we have about 4,000 items we are monitoring now and about 65 self-healing mechanisms have already been deployed. Let me pick two examples. One is rabbit MQ. This is our configuration. We are using a three-node cluster and we allow partitioning by setting Alt key. In this setup, Xavix keep monitoring, watching whether there is a split braining or not by using these parameters. And also, we keep checking the port and the process. In this setup, at least we need to keep running one node. In that sense, Xavix is checking whether there is one active node for the API check and process check. We are checking the numbers. The second one is about MySQL. This is a little bit difficult. Anyway, this is our setup. There are four nodes here and just one arbitrator is here. And all the lead and light stuff go directly to just one node by the load balance. And in addition to usual monitoring items, we monitor those special items. To explain those items, I need to talk about MySQL clusters. If you configure MySQL cluster, your commit is going like this one. It goes to the send queue of the master node, and then it sends to the slave node. And once slave node receives a commitment, then it returns okay. And if master node gets this okay, then finally it returns okay to the client side. And here, MySQL synchronously consumes a commitment and reflect to the disk. This is a basic principle of MySQL cluster. And here is a problem. If there is something wrong happens in between the queue and the disks, then sometimes MySQL cluster freezes. So we test it, and in case of disk failure, that is not a problem. The slave node is automatically removed from the cluster. But the problem happens in case of disk speed throttling in some reasons. In this case, the slave seems working correctly, but the writing speed is so slow. In that case, all the commitment message is congested at the send queue and the receive queue. And once the receive queue is congested, there is no acknowledgement from the slave. But master believes this slave is working correctly. So in this situation, you cannot write anything to the database anymore. This creates a problem. So this is our dashboard when we lock the MySQL. You can see you will get almost all the alert in this dashboard. But don't be upset. Here is the reason. If you upset and do something wrong, then the problem becomes worse. In this case, if operators reboot the OBS agent, or in worst case, if operators reboot the network nodes, then you will get, you will lose all the connection to the running VMs. So if you cannot connect to the database, you can't do those actions, operations. So it is important to understand what we can do and what we can do if we have a problem. That's need to be gathered and we gather those information and putting it to our knowledge base. So that is very important. And finally, this is self-healing mechanism we introduced. So the hard disk throttling sometimes happens when we take a backup of the MySQL. So that's why we limit the node to take a backup. And also we are monitoring those items now. And if there is some changes to those items, then the node is removed from the cluster automatically. Also we change the backup method. We used to support, used to use so-called online backup. But now we switch to the more safe method. So first, we take the node from the cluster, just this thing, from the cluster, then we lock all the tables and create a backup. And after that, we return to the cluster and once just configure a synchronous again. And by supporting these setups, at this moment, we haven't had the same problems. We haven't had any same problems. So just switch to the presenter and talk about log analytics. Talk about our log analytics. We have four purposes of log analytics. First, we have to detect critical system failure and to recover them immediately. We will be happy if logs could tell us system failures beforehand. Second, we need to detect malicious access. When users floating IPs are accessed maliciously, we need to notify users. Third, we need to detect non-critical errors or warnings. Of course, they are better to be fixed as soon as possible. Bugs might be found by those logs. Fourth, we want to identify errors and warnings that have no service impact. We'd like to filter out them next time. We call them ignorable logs. Large numbers of logs are logged by each component of the system, such as hardware, Linux kernel, OpenStack, and operation tools. It's so difficult for us to find out important real message from them. An example of a day, there were 100,000 logs of critical error and warning, but serious logs were not found in this day. There were only six non-critical error logs and six ignorable logs. We analyzed logs and add the result to our blacklist and whitelist. Logs found in our blacklist are sent to Xabix. Ignorable logs are filtered out with our whitelist. The rest are shown in Kibana. We operated analyze them. We add critical logs to the blacklist as well as ignorable logs to the whitelist. Kibana dashboard is very useful for our log analysis, so that the whitelist can keep growing. Logs to be analyzed have been quite reduced. Now, let me explain our architecture of log processing, adopting black and whitelist. Fluent D on every node sends logs to the log servers. Some devices which cannot be installed fluently on send logs to the log servers using RCS log. Rules of the blacklist and whitelist are contained in configurations of Fluent D. Fluent D sends serious logs to Xabix following the blacklist. Fluent D raises a flag to ignorable logs following the blacklist. Fluent D puts metadata to logs in order to create graphs from them. Then logs are stored in elastic search. Kibana shows graphs by referring elastic search records. They are simplified examples. First example indicates a hardware failure. This message is contained in our blacklist, so Fluent D sends this log to Xabix. And alert on Xabix will tell us the failure immediately. The second example is an IDS log. Fluent D extracts the source IP from the message and inserts IDS value to the item key. Kibana makes graphs from this metadata. The third example indicates users operation error. Since this error doesn't impact to our system, we have already added the message to the whitelist. Fluent D inserts ignore value to item key. Kibana filters out this log from all graphs. Let me show you some of our whitelists. The first message indicates access without any token. Health check from load balancers can't get tokens, so this warning continues at all times in our system. We watch on the trend of response codes. We don't need this log itself. The second message indicates that users request what's denied due to quota limitation. It has no impact to the system, but the log has error level. I think it should be info log. The third message indicates literally hypervisor has more disk space than nova database expected. It occurs when instance of shutoff status exist. This is commonplace condition. We ignore this log. We have enhanced our whitelist. As a result, we have been reducing logs to be analyzed. In other words, many meaningless logs of error or running other open stack operators. As you can see in these two graphs of Kibana, our whitelist is very efficient, very effective. One year ago, when we did dog fooding, we could not analyze all logs. Today, two or three hours are sufficient to analyze all logs. Next, let me show you some of our blacklist. The first example indicates that there is disk problem on a computer node. Who sent this log to Zabix as running level? The second message indicates that color sync needs to clean up its resources. This condition itself does not impact to our system. Thus, fluently send this log to Zabix as information level instead of running level. We operators find this alert on weekday daytime and clean up them. This rule has helped us several times. The third message indicates a failure of database backup. But we shouldn't worry about the individual failure because backup is scheduled four times a day. Fluently send this log to Zabix as information level. If this alert continued, we would debug on it. Let me demonstrate usage of Kibana. Six dashboards are available on Kibana. We'll show you three of them. This is a dashboard of all logs. You can put queries to filter logs. For example, these queries filter out ignorable logs. Let's select toggle checkbox to enable this query. Then the graphs, logs of the graph has reduced. Role logs are also available on Kibana classified by their log levels. Expanding the critical logs panel, you will find role message of critical logs. You can put text search on all the logs. Let's add a query to find logs containing create failure and wait a moment. Then the results have appeared containing create failed text. This all log panel is very useful to grasp overview. We prepared dashboard to provide further analysis. This is a dashboard of error logs. Let's take a look on the day in September. Since around 18 o'clock, errors have increased and continued. This graph tells us that neutron DHCP agent log have increased at that time. And this graph also indicates that many errors appeared in need of neutron. I try to narrow down to neutron logs. Then now, neutron has been proved to be in some failure. Role logs would help analyzing what causes. This is dashboard of OpenStack access. This graph colors calls API accesses of each services. You can see detail. And this shows trend of response calls classified into normal authentication failure, invite request, and system error. Later, I'll analyze about this system error. This is important list of users who failed to log in to horizon. The user failed dozens times. So he may be taking over his account. We'd better to contact him. Now, I'll analyze the system error. Let's narrow down the logs to error response. Check this checkbox. You can find detail of the access log. Adding filter with request ID. Then, you can see logs related to this access. I found an error. That's all of my demonstration. Actually, we just finished the presentation. And as I mentioned, we have another presentation from 440. And also, we brought all the system to presentation room. Let's see. Today, we have a demo at NNC booth H4. Also, tomorrow, we have a demo at S14. So, if you want to try our tools and especially log analytics, you can come and we can discuss. And also, if you have further questions, comment, you can send email to this. Thank you very much. I think we can pick some questions. Do you have any questions? Don't much? So anyway, we have an exhibition also today. So please come and discuss. Thank you very much.