 Ladies and gentlemen, thank you very much for waiting. We are now ready to resume a seminar program for this NAC Breakout Speaking Track. We would now like to invite Mr. Kenigalashi, Senior Research Engineer, Entity Docomo, and Mr. Jun Ishii, Research Engineer, Entity Docomo, and Mr. Takashi Tori NAC to talk about what an operator is doing behind the cloud. Mr. Igarashi, Mr. Ishii, please. Good afternoon, everyone. Already mentioned, but the title of the presentation is What Operators Do Behind the Cloud. So during the presentation, we want to share tools we created for daily operation. Also, we want to share activities operators are doing behind the cloud. So, behind of the team, actually already mentioned, so just we can skip. So first, let me explain our team's rules. Sometimes we call it culture. So we are a small team, so just we are focusing on using OpenStack instead of developing OpenStack. And human resources are our key, highly limited resources and most valuable resources. That's why we are always thinking of reducing operation costs and promoting automations. So, this is a key principle. Anything that a human needs to do more than twice must be automated. And in the operation, you can see there is DevOps engineers. Also, we have operators. So we usually call DevOps engineers as L3 engineers, and also we call operators L1 and L2. And L3 engineers are developing tools to automate operators' tasks. And also, he put software and also put instruction to the knowledge base. Today, we have many operation tools. As for the deployment, we have tools to set up a network, in-access account, the login, install Zavik's agent, and drivers and firmwares. Also, we have our own tools to install OpenStack itself. Also, we actively developed operation tools. It has process restart, log collection. Also, we can take usage of OpenStack for charging purpose. Also, we can take a bunch of VM migration and backup and user manipulation. Also, we provide health check tools. So let me show one example of our operation tool. This is, I think, one of the most difficult parts of the OpenStack today. The demonstration is for a neutral network node update. Now, you can see there is a node. We need to maintenance. But you can see there are still LBS agent and L3 agent in this node. So to maintain the node, first, we need to migrate those agents to other nodes. And in our case, we are following the list scheduling and just choose the migration host. So let me show you the demo videos. First, here, we have many routers in a network node called sNetwork01. Also, here is a load balancer agent. And here, we just list up a bunch of IPs allocated to the router on top of sNetwork01. And what's next? OK. Then, OK. How do we do this? Then, we ping to all the routers. Then, time for migration. As you can see, there is an uncivil playbook for migration. So we just execute. And during the migration, you will see some packet rows. Now, it is migrating. You can see the ping energy is returned. But once migration is complete, then we will get all the ping method. You can see now, migration is completed. And all the ping check becomes correct. And you can see there is no more L3 agent on the network node. Also, no load balancer agent on the network node. So this is before migration. And this is after the migration. Many L3 agents. So, mototta deoi. You can see it's out. So, you can see this load balancer was on sNetwork01. But after we did migration, you can see the same load balancer agent is on network02. So after migrating all the agent, then we can disable the node. And we can update the node. And just before bringing back to the cluster, we have a check tool. Currently, we support three types. One is checking access to the PPM. Also, checking access through the load balancer. The final one is checking access through the PPM. So we have a check tool and identify whether the update node is working correctly or not. And this is our result. And you can see the node is updated, get all the OK message. So all the test parts. Then we can back to the cluster and start using. By using the tool, we can update aggressively of network node as well. So far, we are mainly using Ansible for deployment of the operation. And Python and shell script are there. And for deployment, common deployment, we have already created like 37 playbooks. Also, we have 62 playbooks for OpenStack deployment. And as for operation, we have 31 playbooks. So change the presenter. Next, I try to explain L1 to operations. So after KEN's talking, so it is slightly easier. And so almost all people in here is like spot operators. It might be too easy. But so please imagine or remember you are the total OpenStack operator. So today is the first day you operate OpenStack. And so what should you do? So this is very difficult question. But there are the two things to operate OpenStack system. So just prevent troubles and so deal with troubles. And to prevent troubles, the daily operation is very important thing. So we call daily operation as a nitchoku. Nitchoku is a Japanese word. And so in Japanese school, everyone can be a nitchoku. So they do daily tasks, routine work in a day. And so it takes time by every day. So as in school nitchoku, we call nitchoku all. And so we do every day by each operator. And so in our nitchoku, there are check security updates of the last day. And check firmware updates of the last day. And check unknown alerts last day. And check usage of resources of each node last day. And analyze lists, analyze logs of last day. So too many works daily tasks. So it takes almost a day or over a day sometimes to do these tasks without any plans. So we need to more simplify these works. Then so we automated some works. And the nitchoku person carries out mainly three tools. So nitchoku assistant and cervix graphs and kibana dashboards. And so these tools makes easy to do nitchoku works. And so by doing nitchoku, troubles are detected earlier than normal alerts. And so then I talk about these three tools. First, nitchoku assistant. So it is a web crawler which is composed of Jenkins Java and WordPress websites. And so it is automatically reports on three topics for nitchoku. It is need for nitchoku at 9 a.m. every day. So first is the security updates. And second is the firmware updates of hardware. And last one is the cervix alerts. So it saves our time to spend for C each websites. So it is compressed all informations. So and security updates, assistant both the latest security advisories. So and we check the last nitchoku ticket. And so this security information and compare them. And so there are newly information or not. And so if you have a new security information, we create a new ticket to update or not. And next, so we are also liking about hardware informations such as a lot of answers or UTMs or storage or servers. So it is so honest speaking, it is not completely calling cause so some sites are very difficult to call these informations. And so in there it will be a written hardware information and so latest update as same as the security both of softwares. So but hardware is a little bit secret information I masked in it. And so it reports alerts of cervix in last day. So some of you think that it is enough to see the cervix latest issues. But some of our alerts are automatically restarted. And so not shown in the latest issues in cervix. But this nitchoku tool also show these informations. So it is important to know which nodes have our alerts and so automatically restarted or not. So these informations makes us to prevent some troubles. So this is very important works. And we also check cervix screens. So cause each node sometimes have drastically changed of resources. So check the memories or strategies or something like. And so if these resources exponential increase or decrease we create a new ticket and how to deal with it. And we also check the Kibana queries. And cervix is good for real time checks. But it is very hard to see at the grants. So like these screens. So if the physical nodes are increased. So these graphs are increased so much. So we also filter out some errors by queries and so shown newly create newly logs by Kibana. And so filtering query is automatically generated from the wiki page. And so Jenkins kindly makes these filtering queries from these wiki informations. So these are nitchoku works. And so we reduced 50% of the nitchoku times each day. So recently we only take four hours or so to check the nitchoku works. So even though it is so long, we need to improve more. And so moreover, not only the L3 operation people but also L1, L2 people can make these tasks by these tools and the knowledge base. And so knowledge base is our next scene. And so what is this graph to operate OpenStack-based systems? So maybe many components and so each components have a specific specialty. And so we need to unify all the information to one place. And so when travel happens, we only to see these informations and search about it. So it's easier to check all components. So we made DevOps system around OpenStack with some OpenStack, sorry, OpenSource systems. So like Ansible or Jenkins and Gelit or Redmine and Full-ND, Davix, Kibana and et cetera. And so mainly L1 operate us using these right side tools and then check all informations in the Redmine's knowledge base and so then improve some members by these CICD tools and so these tasks. And so what is the knowledge base? So it is one function of the Redmine but it is also the database of the knowledge. And so concentrate all information about operation to one database and so we can easily cope with a problem with once someone has already solved. So it is very important because we can avoid reinventing the fields and reduce the time to search on the internet. So if you have troubles, you only search in the knowledge base. So it is like a goal for us. And we store over 1,200 knowledge already. And so in knowledge base, not only know-how and experience so we also have tools usage. For me, it is not very difficult to understand. We reload them to easy to understand by using Ansible or something these tools. And so L1 L2 operators can also deal with troubles if it is already solved or easier. So not only initial task but also initial response to troubles L1 L2's work in our systems. And it helps other members because they can concentrate to the so important troubles. And it is also good for reduce spend time to user support or share CD tasks. So not only the trouble information but also we reload how to use the support or how to execute Ansible commands in the knowledge base. So all information is stored in knowledge base. And so operators only need to fill out template forms and so you can easily execute some commands. And so through this knowledge base, everyone can be no expert to expert. Of course, this knowledge base can help you to growth to a more expert. So it is maybe difficult to understand so I'll show you a demo later. So when something goes wrong in private cloud, so this is how to say travel shooting systems. When something goes wrong in our systems, so Zabix automatically detects statement and so showing a new issues like these figures. And when we're not lost a lot from our cloud server, Zabix automatically make a link to knowledge base. So this is very important because if a lot happens and then they have a link, L1S members can deal with some initial response to these alerts. And so operators only need to access this knowledge base and ignore or solve a lot according to written rules. And so and more, almost all solution in knowledge base has Ansible playbook. So it means all way to resolve troubles are written in the machine language, not the human language. So it is very confirmed to execute the troubleshoot and so reduce human errors. And so they have a guaranteed idempotence. So it means if you execute command 1s or 10s or 100s, its results are the same. And so it is also useful for non-expert can learn troubleshooting how troubleshoot by reading these playbooks. So what commands you should do in this program is so everything is written in these YAML files. It is very important. So I'll show you an example. So one frequent error in OpenStack is a system process down. Some kind of system process is down. And when it is a physical node process, instance on these physical nodes also have troubles. And in next example, I demonstrate the operation from detection from some live-out-bin system process down. So when live-out system down, we can't create change instances. It is very important how to say. It is very critical, critical errors. So I'll show you a movie. So first of all, there is a Davix service shown. And now status is all green. There are no issues shown. And then I try to stop some live-out-bin process by my hands. So check the status. Now it is working in the physical node SP001. And so I stop it. And so I need to input storage passwords. And so then the live-out-bin process is stopping. So check the status. Of course, it is stopped. And then Davix will make some... will receive alerts and so show the new issue to here. So alert is here. And so check the detailed information by clicking the last change. And so now the status is a problem. And it is not solved. So we need to deal with this problem. So how deal with it? So we are only clicking this issue's title. And so it has a knowledge-based link. And so when live-out-bin process down, we should do these things. And in this template, number two shows you need to make some action or not. And so in this case, operators need to do some actions and so check these flaws. And another knowledge-based are shown in this. And so I click here and so check how to list your services. So we are logging into the some utilized server and need to execute Ansible playbooks. So to execute command without some typos, we recommend operators to copy and paste from the knowledge base to the command lines. We need to input hostname. So then to copy the hostname, move to the service servers. And so copy them and so paste it and so confirm. And so enter service name. Now the stopping process is a live-out-bin. So copy from here and then paste it. And so then play the Ansible playbook. So the process is restarted. And we also need to restart the Nova compute process after restart live-out-bin. Then execute the same playbooks and change the service name. So if you want to know what kind of operation is doing in these playbooks, you can check the YAML file. So this helps operators to be an expert. And then all the illustration was done and so the big issues will be disappeared. And the detailed information is changed to the OK. So that's fine. So these prevent troubles and make some initial response to troubles is too important thing to operate open stacks. And we also make some expeditions in NAC booths and entity group booths. You can try these demos in this booth. Please come. And if you have some... If you want to make some contact, please contact here. And so booth is here. And finally, operator training by Mr. Tree. So finally, I'm talking about operator training. This slide shows the user segmentation. We think that there are three categories to our customers. Firstly, we think about... There are more right-sided super users. So super users can do any tasks, design and planning and resource management and operation are all in the super users themselves. So they can have knowledge and knowledge base in themselves and have a specialist expert of open stack. The right side is... So using the... Sorry, I don't forget about... Using the managed service provider services. So in this services, the user cannot do everything, anything that the service provider can provide design and planning, resource management and also operations. So user cannot... Can't do any more about operations. They can use the open stack environment from remote or using their private cloud environment. So in the middle of this two segmentation, there are some our, I think, potential users. They can use the design for... Design for integrators, the planning and resource management and operation are used by themselves. In this category, the user can... They have some big problem about how to operate the private cloud system. We think that the knowledge base with non-specialized engineers will be the solution for them. This shows our concept of operator training. We think that operator training will be a shortcut to become open stack operators. And now there are many... training, open stack training, but that common way is their target is to be a user, a source user and integrator and operators. So in their training, our course is first three years of what is open stack and studying the architecture and how to install, how to use the horizon, how to use the API of open stack and how to design and how to operate. But for... We think that for becoming the user, integrator, operator, there are important things to know about. But to become operator only, they are studying the architecture or install or using the API that this knowledge is aggregated to a knowledge base. So in this training, the user can be a very shortcut to be an open stack operator. So after that, this... this training is our first step to be an operator, but the second step, more training or educated to become a super user or super integrator or super operators. And I think that to become an operator, our program will be a shortcut. I think we think that the standardized operation knowledge will create the ecosystem around that. That figure shows the ecosystem and the standardized operation knowledge base. We think that training is just one case. There are possibilities around here. For training, if there are type of... if there are these type of existing ability created and share the knowledge in the community, it will help analyze the open stack to non-super users. And this morning, the NEC's Shibat-san will talk about the super integrator. In this training or this knowledge base I hope that this not becoming a super user, but to be a user of open stack. We are now planning to start this training for operators in early next year. And we are collaborating with LPEI Japan about the certification of operator training. And also, we think that being our developer, we collaborate and create the knowledge to the operation knowledge base. There has been helped their product or their project or software. And also, a super user or integrator will help this knowledge analyze. They input the in-house knowledge to this knowledge base and share this knowledge base community or more open community of open stack users. And this will help to analyze the open stack system to non-super users. So that is our final... final. Yeah, thank you for listening and any questions? Any questions? So thank you very much for your time.