 Good morning everyone thank you for coming to our talk. So we're from entity communications and we're here to present about our topic under the operations track and it's titled open-stack operation under a multi-tenant and multi-customer public cloud environment. So just a quick introduction about ourselves. I'm Yuta Hono mainly working for the operation tool and the support tool design and development. I'm Koji Hota. I'm working for the Apache server services enterprise public cloud for the, especially for Nova. And I'm Charles Siloy and I'm really working for infrastructure management. So this would be the outline of our presentation. So we'll have an introduction, then we will present the requirements of our service and then we'll present the challenges that we faced and how we solve them and we will present an actual use case and demo. And finally we'll wrap up this presentation. Now regarding the introduction I'd like to emphasize that we will be explaining about the business background of our service because it's important in order for us to understand the requirements of our service. So introduction. Just a quick rundown on where we come from. We come from NTT Group and NTT is actually the one that won the Super User Award last summit in Tokyo. And under NTT we come from entity communications. And just some numbers. NTT Group is one of the largest, world's largest ICT companies. In Fortune Global 180% actually choose our services and we have over 105 billion in total revenue. And regarding entity communications, our headquarters are in Tokyo, Japan and entity com is one of the leading cloud providers in Japan and one of the biggest data center operators in the world. Our services include data centers in over 140 countries and regions, VPN in over 196 countries, global tier one, internet backbone, worldwide marine cable and ISN pass services worldwide. And now in what of our services do we actually use OpenStack? And so we use OpenStack in our service called entity communications enterprise cloud. It's mainly an IS pass managed cloud. And the range of services that we give are actually including servers, storage, solution packages, which is mainly deployment services and network security backup app service and management. And this is important in understanding our requirements for our service. So in the global market entity communications enterprise cloud is actually available in 14 regions. And if you see the map that's that is you can see the different regions where our service is available. And what we have one more planned. And because we have so many regions, we have global affiliates like entity America here in the US entity Europe entity Singapore entity come Asia entity come ICT or sorry. And in this wide range, we have multiple support teams around the world and catering to multiple customers with multiple languages. So why do we use OpenStack? So we use OpenStack because firstly because it's open source and because it is open source, it has a continuously expanding and active developer community. That's why we're all here in Austin in attending this OpenStack summit. And the version that we use is Juno, the 10th release. So if you look at this diagram, this is actually the summary of how we use OpenStack in our service. So if you look at the left side, that's the category of our service. And on the right side, you can see the legend. And the dark blue ones is they are the OpenStack components that we use. So we use Horizon, Keystone, Nova, Glance, and Cinder. On the other hand, the light blue ones are actually original components that we have developed ourselves, which are compatible or partially compatible with OpenStack API. Now, why not just use the original components of OpenStack? Why did we develop original components? It's because the original some original components of OpenStack actually do not meet the requirements of our service. And so we have to develop ourselves. This includes Nova compatible bare metal servers, provision IO block storage, and neutron compatible network and cylinder compatible as well. And on the other hand, we also have original components that are not compatible with OpenStack API. But we need them for our service, including our solution packages, which including deployment services for SAP HANA and dedicated Hyper-V. Also regarding network, we have firewall, load balancer, VPN connectivity, and those things. And we also have user contract management with partner APIs. We actually provide this with Keystone. And we also provide this for our customers and sales partners as well. And we also have other open source software, which is Cloud Foundry. So this is important. The business background of our service, our main users are actually enterprise users. That's why enterprise is a public cloud environment. And however, there are actually gaps between the version of the OpenStack community and what actually the user wants. And so because of that, we have to establish the requirements of our service. And mainly we have two requirements. The first is to support both traditional IT and cloud native IT. And the second is to have a multi customer and multi tenant environment. And here is Koji to explain the first one. From now, so let me explain to support traditional IT and the cloud native IT. So the traditional IT and the cloud native IT can be explained through the PetPass as a model analogy. At first, let's think about the PetPass model. So when you have a pet, such a cute dog. Actually, he's my pet. And his name is who? And you give a special name to your pet, like Utah. And when the pet gets sick, you move them back to your house. And so you love so much. And it can be replaceable. On the other hand, the kettle is at cloud native IT. So the cloud native IT can be designed for the cloud environment. So if the one VM gets a get down, so... Oh, just a minute. So this guy is at cloud native IT. So when you have a call, like an image, so it is difficult to distinguish by the appellance. So you don't really give the name. And you don't give attention to anyone of them. So it can be replaceable. So let's get back to the traditional IT and cloud native IT. And so at first, PetModel. So PetModel is like a traditional IT. And so some traditional IT users are struggling to migrate apps to cloud and other hypervisor require a new hypervisor, a new hardware, and in order to replaceable hardware. And so in order to reduce the cost of replacement of hardware, they would like to move the cloud at first, after that they change the architecture for the cloud environment. After migrating, they move to the cloud environment. However, if the VM of the traditional IT apps get down, the end user, the VM of the VM service of the traditional IT have been affected. And the service impact is great. So in the mean that the traditional users are like, yes, provided to recover the VM immediately. So in order to support the traditional IT and PetModel, so we need to have the VM, the traditional IT VMs should have HA. So on the other hand, the CattleModel, the CattleModel is cloud native IT. So cloud native IT is designed for the apps for the cloud architecture. So in the mean that if one gets VM down, it will not be noticeable to the end user. So there's almost no service impact. So in order to support the both, so we need to support the both traditional IT and cloud native IT. And the two supports are PetModel. So we need to implement the virtual machine hybrid reality, we called VMHA. But VMHA is that as you can see the below diagram. So the VM on the compute node will be evacuated to the, it's distorted in case of any failure. So the reason why we VMHA required in Enterprise Public Code is there is a two perspective. Why is that, which I mentioned in previous slide. So as a traditional ICT user minimum, I would like us, yes, provider to minimize the impact of the traditional IT. And the other perspective is that the public, yes, provider. So they would like to keep us, we need public, yes, provider need to keep public, yes, working even if our incident and the outage occur. So we need to implement the VMHA in OpenStack. But, you know, so there is no VMHA function in OpenStack Nova Community. So the reason why VMHA is not in OpenStack Nova Community is the implement the VMHA is to know about doesn't match the content, doesn't match the concept of the mobile. So as you mean, the application should be changed. Cloud native architecture. So as a simple solution, so we can implement the VMHA function to Nova, but maintenance cost, maintenance and operational cost increase. So it could, big barrier to, it could create a big barrier for the OpenStack version upgrade. So we'd like to avoid the custom Nova itself. So this is our solution. So we use the Masakali to realize the VMHA in OpenStack. Masakali is open source, which is originally developed by NTT. And right now, so we can, if you want to use it, so you can download here, the URL. And we deploy the extra components, we deploy the Masakali as the extra components and deploy it outside of OpenStack. So you mean we don't need to modify OpenSource code itself. OpenStack source code itself. And Masakali also can meet our service requirements for the pet models and the cat model. And Masakali, for the pet model, so Masakali can rescue the VM and rescue the VM back and work automatically. And the other cat model, the customer can choose not to use VMH function. Which function is provided with Masakali? So this is a simple architecture of the Masakali. And the technical key element is the Masakali. And that is a controller and the agent. The Masakali controller is on the Masakali node. And also, the Masakali agent is in each compute node. And a pacemaker and a crossing is required for each compute node. And the Masakali can rescue the VM, like the VM of affected by the two cases of incident. Why is that host down issue? And see another is single VM down issue. So let's see the example of each cases. The first is that the host down. The VM on the compute node 2 gets host down. Another host can detect that compute node number 2 down is pacemaker crossing. After that, Masakali agent notified the compute node number 2 down to Masakali controller. And then the Masakali controller executor, recovery request to the node by PI. So and then VM1, VM3 will be evacuated to the compute node number 1. And as you can see the compute node number 2, so there is a VM2, each have a VMHF flag. So it is a cloud native IT user. They deploy the VM into this node. And then that VM will not be evacuated to another node. This is a host down issue. And the other case is that let's see the other issues. It's a single VM down issue. Now compute node number 3. So VM intentionally get down on the compute node number 3. And then Masakali agent detects the VM down. And then Masakali controller, Masakali agent can detect the VM down and notify the single VM down to the Masakali controller. And then recovery request executed. And then the VM will be revealed on the same node. So by using the Masakali, so we use the Masakali, so we can support both traditional IT and cloud native ITs. And so from now, let's go to the next topic, so next requirement, much customer, much tenant environment. Okay, so since we provide an open stack as I use in the past service for the customer as a public cloud, we have to support a much customer, much tenant environment in a same box. So here is an example. So as you can see, here is one compute node. And one compute node, there is resource A, B, C, and which is owned by customer A and B and C. So impact model, we needed to let customers know about the some things, if some things happen to their resources, like, hey, your resource B goes down, like that. And to support this model, we need to have a way to find out the relationship between resource and the customer information, like resource A is owned by customer A and the customer A's mail address is blah, blah, blah, blah, like that. So this is a thing. And it's very simple use case. Here is an actual environment, which is more complicated. So on the left side, there is multiple customer has colored by different color. And it means those kind of customer is supported by other support team. So there is some reason why we have a multiple support team. Why is that? The wrong gauges difference. So we don't have a support team to support English, Japanese, and sometimes it's Spanish, like that. So to support this kind of use case, we divide a support team. And sometimes the region, like EU, have a regulation, like personal data protection regulation. So it means only the UK support team can access to the customer's personal data. So US and the EN support and also JA, JP support shouldn't be able to access to their data. So this is a requirement. Also, in one compete notes, sometimes there is multiple tenants, which is owned by a customer. And on the left side, as you can see, there is a firewall instance. So as Charles mentioned, we provide a firewall as a service, and which is built as a managed instance on a normal compute node, which is in noble. So in this case, this firewall instance is built on the compute node, which is the same location with the customer, and which is owned by NTT com, the tenant itself. So it means we cannot track the information based on the tenant's information. It's not enough. So this is a requirement. And here are the challenges. First challenge is that one incident could affect the multiple resources. So as I mentioned, in pet model, notification to the customer when an incident occurs is really important. So if something's happened to the neutral network, it have a relationship with the novel instance, right? So novel instance, the network connectivity should be affected by this incident for neutral network, and also some other services, like bare metal server, and NFS block storage, like that. So we need to let customers know about that, not only the network name, but also the related resources information. And it's much more complicated in mouse customer, mouse tenant environment, as I mentioned. And here are the other challenges. Yes, of course, we have masakari, as Koji mentioned, but masakari, VMHA itself sometimes cause a difficulty to track customer information. So here is an example. Actually, the virtual resources is recovered by VMHA immediately, but actually, it takes a time, something like five minutes, around five minutes or like that. And actually, customer have been affected by this. So we need to let customer know about that. That is the same. And if something's happened to the computer nodes, too, in the right side, masakari trying to move those VMs, which is in computer nodes, too, immediately moves to other hosts, in this case, computer node one. But after it's evacuated, we don't have a way to which resources is in computer node two. So it means we don't have a way to know which VMs have been affected by this incident. And also, sometimes you may know, sometimes failure, sometimes may happen for evacuation itself. Evacuation itself could be failure. So in this case, we need a manual recovery. And of course, we can take the way to such a database, like connect with Nova and the Sindhu and the Newton, trying to get those kind of relation map from a database. But it takes a time. And a raw T engineer sometimes cannot do that. And the masakari log just indicates the log of trigger for VMHA. So we sometimes miss the VMs location. So that could be a challenge for us. And one more thing here is that the OpenStackDB shows only the latest status of virtual resources, like here is a use case. Sometimes customers ask you about, like, hey, my resources have been affected of something one day ago. But when we get that kind of inquiry, we don't have a way to know the virtual instances status which is in one day ago. So that could be a problem. Okay, so let's move on to the solution for that. We have implemented two solutions, two types of solutions as one application, which is called as an operation portal in our team. And this one has two functionality to support engineers and also operation engineers. One is that resource state location history correction for multiple services. Here is a very simple description. We get the customer resources information from open multiple OpenStack services DB or sometimes API every five minutes and correct that information to the resource history DB. And we provide the functionality to see that from operation portal, which is for operation and the support engineers. So operation and the support engineers can check the historical record of the VMs history and the location. That is one thing. And the other thing is that instant ticket association with resources information. So we handle the instant other tickets to task to manage the task. And we have a function, we implement the functionality to associate those affected resources and instant tickets. I will show that later. So this is a very general overview of our solution. So let me jump into the solution one resource state location history correction. We have a resource corrector in each region and correct customer resources information from OpenStack services with the API or sometimes DB with. And we gather those kind of information in one location in each region. And we provide that to functionality to check that from the GUI. And here is the corrected resources by resource corrector. It's just an example. But we're correcting the information from NOVU like instance that table and also instance metadata, aggregate metadata, and aggregator host. So to know the VMs current location, this kind of to correct the information from this kind of table is really help us a lot. And also Sindhu and the Neutron, of course, some other services, we gather the information in one location. And here is a solution to the overview. All the information is associated with instant handling sketch. As I mentioned, we handle the instant with using the sketch system as a task management perspective. And we can put the affected resources from resource history database to instant tickets. And operate to keep updating the instant tickets based on the latest information of instant handling. And here is the instant tickets overview. We have a datacard, datacross and FX for the resources and affected customer resources like that. So based on this sketch, our support and account manager, sometimes they got an inquiry from a customer, and they can handle that by themselves based on the latest operational information. And otherwise, that we have a functionality to send a bulk email notification to customer like a customer. So it means we needed to put each parameter like, hey, your tenant A goes down like that. But if we do that by manually, so it takes a lot of time. So we automate that functionality with based on the parameter function for the email notification. So here is the actual use case and the demo. Let me jump into the demo. So this is a very simple use case of the demo. First of all, suppose there is a host down issue for the Nova compute node. We got a lot from monitoring team, and also operator checks the which hypervisor gets down and checks which VM are affected. And basically, VMs are restarted automatically by VMHM massacred. So we don't need to do anything based on that. And send an instant notification then and send a recovery notification. Okay, let me jump into the demo. Okay, so here is our operation portal page, actual operation portal page. Here are the alarms which comes from our monitoring system. And we have a functionality to multiple handling like this. And we use, we can acknowledge or terminate those alerts to notify to the monitoring team which is already handled by our operation team. And the one funny thing is we have a functionality to put those alerts to the cart, like webcommerce site, like Amazon. So we can add that to the cart, and then we can associate those kind of alerts with the instant tickets. So let me go to the instant tickets page. Here is the instant tickets. We can check all instant which is happening in our environment. And we control that from here. And we can check the instant detail from here. So here is the actual instant page, instant tickets. And here is the instant description. Of course, this is just a demonstration. It's a dummy data. But you can confirm when it's occurred and when it's recovered and the status and also customer impact, and the region which service are affected by this incident, and also some note. So usually our operator update this note for the latest information. And our support and support and account manager can confirm the latest status of the alert instant handling from here. And here is the associated alarms which comes from our monitoring. So monitoring team can know about that which alarm is already handled and which alarm is not handled. And we can put the affected resources to here. And then we can send a bulk notification from here. So this is actually group means our sales channel. We can send an email per sales channel like in a different language or a different team. We have access restriction for that. So let me go to the resource correction database. So here is a page of the resource correction database. For here, we have a searching field for the data. And here is the data. And here is the data which we do. So we can search by region channel which we have like American or JP, like that. And here is a contract ID and a tenant. And this is the thing, data and time. So if you put the date which is before and now, you can see it. And here is a lot to know about the history of the VM status. And also here is the filters for resources. We usually use host name filter. So if something happens to the Nova computer node, we can search the affected resources from here. And also here is the instance list. And here is send a volume of the list. So we can check that and put into the cart also and can do a shopping sometimes. OK, so let me go back to the presentation. So finally, after creating instant tickets, we can send an email for the customer. And based on the instant ticket information, customer's notification is automatically filled out based on the information in instant tickets. And here is an example. If we put tenant.tenant ID, tenant ID per customer is shown on here from the customer's perspective. And we can use if statement like velocity template. So it means in here, if customer have a VM, this virtual service part is shown for the customer's email. And if not, it's not shown on here. So it's very useful. And of course, we have a lot of challenges currently, furthermore. And one thing is that operations, automation, and hands-over to the raw tier engineers. We will have a plan to expand this portal to have a functionality to operate some of the operation like live migration and evacuation to the multiple customer resources. And yes, of course, we can do that from a CLI. But to hands-on to those tasks for more raw tier engineers, we should have that functionality on GUI. So we will implement that. And also, ultimate instant ticket creation and the notification for customer with predefined pattern for known pattern instant. So since we expanded our services to US recently, we don't have not so much known pattern of that. Actually, US data center launches the day before yesterday. And we will implement, automate that. And also, we will provide the functionality to check the instant handling status with sales partner. Because sometimes sales partner get an inquiry from their customer. And they would like to know the latest status of the instant, including the internal information. So this is the plan. OK, so let me wrap up this session. We introduced our use case of OpenStack operation under Mauts Tenant and Mauts Customer Public Cloud Environment. Hopefully, this use case and this presentation is useful for you guys also. Like if you operate this kind of complicated environment or sometimes you want to get more reliability for virtual machine itself. And we achieved a quick notification to each customer and recovery VM affected by instant with resource history correction and also VMHA, which is Masakari. And we will keep contributing to OpenStack community with the feedback and sending patches to the community and also this kind of knowledge sharing with the community in this summit. Right, that's all from us. Thanks for coming to this session. And if you have any question, please let us know. In Masakari, using base market, the Herbid in a full mesh architecture. Is there any scalability issues with the product? Sorry, say again? Charability issues with Masakari because of the full mesh Herbid that you are doing down there between the hypervisors? At this point, we just release our services in this much. So we haven't to get no specific incidents with the caused by the Masakari right now. But yeah, so at this point, no. OK, thank you. OK, I think we can close this session. Thanks for coming.