 So I'm Masahito, nice to meet you. I'm really happy to be here because it's like three years or four years after the last, my offline presentation. So I'm really excited to do this presentation offline. Yeah, again, today, I'm Masahito. Today, we will present Line's journey loaded to four million cores in the private cloud. Sorry. The first, let me introduce myself. I'm Masahito. I'm a software engineer and manager. And I'm working for Line over three years now. Hi, this is Mitsuhiro Tanino. Nice to meet you. I'm also the senior software engineer working for the Line over the three years. Thank you. Yeah, today, in my part, I will explain the overview of Veruda, what is Veruda private cloud and what we provide to the Line application developer. And also, first, I will explain the history of the Veruda private cloud. And in the second half part, Mitsuhiro will explain about one technical detail challenges we did. OK, first, Veruda is a private cloud platform for Line. So the Veruda provides ES layer, like virtual machine, NAT, load balancer, and DNS as a service, and also the path layers service, App Engine, container, managed elastic search, and et cetera. So we provide these services for Line application developer. Yep, and this is the scale of the Veruda and Line infrascale. So in Line infrascale. So right now, in Line, we totally have over 70,000 hyper physical servers. And our peak of user traffic is over three terabps. And in this scale, the Veruda manages over 46,000 bare metal servers. And also, as a hypervisor, we manage over 7,600 hypervisors. And also, over 100,000 virtual machines are running on these hypervisors as well. This is a service stack in the Veruda we provide. So first, as you can see, the bottom of the layer, we provide the ES functionality to line application developers, like identity, virtual machine, bare metal, network, and storages. And on top of this ES layers, we also provide the managed services, like managed Kubernetes, managed MySQL, managed Redis, and et cetera. And in addition to this, the managed service, we provide the functional other service and the CI CD pipeline using some open source power. OK, so then I explain the quick overview of Veruda itself. From now on, I want to talk about the quick history of Veruda. Roughly speaking, we can divide the history of Veruda into three phases. First one is startup phase. Second, expansion phase. Third, new infra phase. So in each phase, we have different type of challenges, and we solved lots of different problems. So I want to explain some specific challenges and the problem we solved in each period. In the first, the startup period. So at the times from 2016 to 2019, we have, we means line infra, have one big problem. Infra provisioning takes two long times. For example, even though we provide the one virtual machine, it takes over two, almost two weeks. So this is a communication flow before the Veruda. First, the application developer in line have to apply the infra request workflow to infra teams. And then this infra teams does some consultant to understand the request details. And once the infra team had understood the request details, they started to set up the infrastructure like virtual machine, set up storage, and et cetera. After that, the infra teams have the configured infrastructure to app teams. And the app team can start to set up their applications. So this work entire scenario takes long times. So to solve that, the Veruda solves opens a private cloud with minimum API set to line application developers. So after the Veruda, actually, app developer and infra teams could work individually. In the application developer side, first, they create a resource from API and GUI. And then Veruda provision the infrastructure resource automatically. And then after that, they can start the application develop, sorry, set up. And on the other hand, in the infra team side, first, what they need to do is just, first, work resource management from hypervisor point of view. And next, the administration, the Veruda cloud from API GUI. So at that time, there were some open-stark technical challenges. First, they start to have common open-stark API to developers as soon as possible. Because at that time, it took all infra provisioning took too long time. So to solve that problem, the faster API service was important. So that's why at the moment, we focused on opening API set to the developers. So to do that, we minimize the API set from open-stark to provide the line application developers. And also, we developed lots of line original API set and its component, like a VMeter API or API filters. Yeah, sorry, not and. After that, we had one culture change. So after we opened the Veruda, Veruda changes infrastructure resource characteristic from a facility to on-demand resource and API manageable resources. So from the app team's point of view, the less communication are required with infra teams. And infra team site, they can do the bulk resource management instead of the details and specific infra management. So these were our challenges and culture change we did in the first period. OK, then let's move on to the second period. In the expansion period, there was a problem again. So line developers had to install a common middleware set by themselves. As I explained, app developer was able to provision their virtual machine. However, to develop the line service, we, of course, need some middleware, like database-managed Kubernetes and et cetera. So by changing the infra management culture, we changed the infra preparation time from two weeks to 10 minutes. However, middleware preparation, there was no change. As you can see, even though app developer can prepare their infrastructure from the API, after that, app developer have to communicate to the DB administration teams to set up its database and did database teams also need to understand the middleware requirement and monitoring everything, monitoring and administration task. Yeah, to solve the problems, we know the solutions. So Veruda solved, open-managed middleware service APIs, like we did in the virtual machine or storage years layer. We did the same solution. We applied same solutions to Veruda. First, Veruda opened the managed middleware service API. And then from app developer point of view, they can create database resource or Kubernetes resource, these kinds of middleware resources from the API and the UI. And Veruda set up automatically the database resources and also the any infra resources, which is used as the underlay resource. And from the DB team perspective, they only start through monitoring and DB administration task individually. Okay, this was the second, this is the problems and what Veruda solved in the expansion period. And at that time, we had the another type of OpenStack technical challenges. First, the OpenStack, opening managed services like managed MySQL, managed Kubernetes, triggered rapid growth of the OpenStack scale. So the beginning of the 2019, the number of hypervisors are almost 1,400 hypervisors in all regions. However, at the end of 2021, the total number of hypervisors was over 6,000 hypervisors. So this rapid growth required OpenStack deployment topology changes and tool changes to manage huge scale of clusters. And the second problem we tried to solve was the some OpenStack API plug-in was not measured in the large scale clusters. So for example, the Kubernetes has a CSI plug-in to manage the persistent volumes. And the CSI plug-in provides the sender support. However, the sender support calls lots of OpenStack API calls, even though all the task is not completed or in the middle of the progress. And the same situation was happened at the Ansible Keystone user management plug-in. So they doesn't care about over 20,000 users inside the Keystone data or some things. Yeah, and we solved these problems one by one at the time during the expansion period. And we did one culture change at the times for the line developers. So line developers can now focus on developing service applications it's safe. They don't need to focus on setting up their middlewares, infrastructure, and everything by themselves right now. OK, then the last new infra-period, it's still ongoing. So we had one problem right now. The problem was the infra-management skill really relying on development teams, developed Verudas' knowledges. So as I explained in the startup period, we developed our custom API, like bare-metal API or NOVA and Incubanatis as well. Because of that, the standard infra-management tool, like Ansible or any sophisticated tools, can't use some Verudas API with default configurations. So some teams who has the Verudas knowledge can use these sophisticated infra-management tools to Verudas. On the other hand, there are some others teams which doesn't have the Verudas knowledge really do some traditional manual operation, even though we provide the API ZUI. So as you can see, the figure says that the app developer used some resource provisioning tools, like Ansible and Cluster API and things. However, because we provide bare-metal API with our custom API, the resource provisioning tools cannot use these APIs. So they have to develop some plugins for this one or they have to do the manual operations. So to solve that, the solution is Verudas solved, straighten API stack in the Veruda to follow the default standard API set. So as you can, right now, we developed some new features to support our infrastructure. But the Azure OpenStack is one of the default standard API sets for Kubernetes or for any tools. So that's why, right now, all application developer can use the default API set to manage our line's infrastructure. So right now, one of the big, our challenge is to revisit the OpenStack API philosophy. So OpenStack API philosophy, one of the OpenStack API philosophy is to provide unified API to manage some type of backend resources, like virtual machine bare-metal servers and any type of companies computing resources. So to realize that, we renovated some API implementation to follow the philosophy. And by doing this, we did some one-culture change in the line application development teams. So we solved tool side in the application development teams. So they can use the common management, infra management tools to manage their applications right now. OK, I quickly go through all the three phases. So let me summarize what we very well realized and the OpenStack challenges in this period. In the first startup period, the very well realized changed infra communication style from the talk style to using API GUI. And from technical point of view, of course, open the ES API to the application developers. And in the expansion period, the very well realized changed the middleware management style. Actually, this is almost same as we did in the infra communication style, but we applied it onto the middleware management layer. And also the OpenStack side challenges. So we support 500% rapid growth in the three years. It causes lots of our challenges from technical point of view. And last, the new infra period. So very well realized changed or reduced the team knowledge gap between insights each development teams. And from OpenStack team side, we straighten the API stack to be available for any tools. OK, this is the last slide from my part. So we have some lesson learned from the last seven years journeys. So first one, the culture change made drastic improvement. So for example, even though we provide the API or GUI as an API layer, if the infra teams communicate to the app development teams directly, it doesn't reduce the communication cost. And second thing, the technical bottleneck depends on the infra scale. So now this is a slide to review our history. So that's why it's easy to say, hey, we could think about any challenges we did recently. However, this is just a review. So that in the beginnings or in the middle phase, we cannot that there was a more different bottleneck. So there was different bottlenecks. So each phase, we should focus on each problems. And last one is open source ecosystem has strong power. So right now, we provide the standard API set to the application developer. So they can use everything and they can learn how to use it from not only the Veruda internal document, but also from the internet. It helps a lot of the application developers from the knowledge perspective. And from now on, I want to pass the mic to Tanyono-san to talk about one of the technical details of our services. OK, thank you, Masahito. So from now on, let me explain about one of our technical challenges for the improvement of the bare metal server management system in Veruda. So let me explain the background of our improvements. So as we explained, the Veruda provides two types of resources, one is the virtual machine and the second one is the bare metal server. For the virtual machine, we use open stock-based IS management. And for the bare metal, we previously supported in-house server management system for the bare metal. So we provide the open stock API for the virtual machines, but we provided another API for the bare metal server. And this is complicated for the Veruda server, Veruda application developers. So as I explained, from the developer perspective, they need to understand completely different two types of the API to automate the virtual machine management and the bare metal server operations. And also, for the Veruda operator's perspective, Veruda operator always need to develop the same functionalities for both of the virtual machine management system and also the bare metal management system. This is also increased our management cost, development cost, and also the operation cost, our operation cost for the bare metal operator side. Therefore, we started the new project to improve the bare metal server management system from the 2020. In order to improve the bare metal system, management system, there are some of the requirements which came from the application developers and also the Veruda operator side. For the application developer, so we need to, our Veruda new to provide the unified API for multiple resources so that the application developer can automate the virtual machine resource and the bare metal resources by using the OpenStack unified API. And also, we need to provide the same level of functionalities for the virtual machine and the bare metal server so that end user can utilize the same series of the functionalities for the both VM management, VM server, and bare metal servers. And also, one of the requirement to provide the bare metal server is the application developer requires a private stock which is pre-assigned to the application developer and the user project. It means dedicated bare metal servers for the application developer team. And the requirements from the built operator side, so from our side, we try to reduce development, maintenance, or management cost as much as possible to operate not only for the virtual machine as well as the management for the bare metal server system. And also, one another key point is that we have the strong hardware layer management system to manage the IPMA layer and also the OS installation layer which was already distributed to the multiple data center in multiple regions. So one request is we need to reuse this hardware management layer in our new, improved bare metal server management system. From these requirements, what we can complete it, this list shows what we did, what we completed from the requirements. So first one, in order to manage the bare metal server by using the open stack, we developed the Nova Compute Driver for bare metal server management. As you know, OpenStack itself has OpenStack Ionic project which is able to manage bare metal server. But however, from our requirements, we already have the IPMA management system and also the OS installation layer. That's why we didn't use the OpenStack Ionic and instead we developed the Nova Compute Driver to support the bare metal server management. Also the second point from the requirements, so we need to provide the server stock management mechanism for end-user, so we also implemented the stock management mechanism into the Nova. And third one, in order to deploy the bare metal server to be very high available environment, application developer requires the HA purpose to distribute the bare metal server to multiple regions or multiple availability zone or multiple racks or etc. So we support this feature called the HA group. And the final one, in order to deploy the bare metal management mechanism control frame site, we prepared the CI-CD pipeline by using the ALGO-CD. So let me explain these items one by one from the next section. So at first, let me explain what is bare metal driver and architecture with the bare metal driver and also the deep dive to the feature which we provide for the bare metal management system. First one is what is bare metal computer driver? So the bare metal computer driver as I explained, it's OpenStack Nova Compute Driver developed by Ryan and the driver communicate with the physical server management system like the IPMA management system or OS installation system to build up the bare metal server. And this driver supports basic OpenStack functionalities like the create instance, delete instance, rebuild, start, stop, etc. And also by using the new bare metal server management system, user can, the user can create new bare metal server from their pre-assigned stock. So the right figure shows so that you can see in the OpenStack Nova layer, we implemented the bare metal driver same as the driver. So from the application developer perspective, they can use both virtual machine and bare metal server by using the unified OpenStack API. So this figure shows a more detailed deployment architecture of the Nova Compute or bare metal computer driver. So as you can see, center of this figure, so this shows the BELDA Kubernetes service. So in the BELDA team, so we deployed the Nova service on top of the Kubernetes environment. So for example, Nova API, Nova Scheduler, Nova Conductor, they are running on top of the Kubernetes environment as a state of the set or deployment resources. And the red box shows that this is newly developed the Nova Compute for bare metal. We also deployed the Nova Compute service to manage the bare metal physical server which is running on the Kubernetes environment as a deployment. And the bottom box shows the bare metal server which is managed by these Nova Compute driver. And the right box shows the existing physical server management system like the IPMOM management system and also the OS installation system. So let me explain how these components works together to build up bare metal server instance. So you can see the steps in here. So at first, a user make a request to create a new bare metal instance via the dashboard is shown here. Then once the end user try to create the bare metal instance from the build UI, the API request go to the Nova layer, Nova API. And some operations happens inside the Nova and one of the Nova Compute is picked by the Nova Scheduler. Then Nova Compute start to handle the bare metal instance creation flow. During the instance creation flow, Nova Compute at first try to make a request to pick she boots the bare metal server. Then after that, the Nova Compute bare metal driver also make a request to create the OS installation task to the OS installation system. Then after that, the OS installation happens for the bare metal server automatically and after taking 10 minutes or something, the bare metal server launched and then Nova Compute driver reports the instance creation is completed to the Nova layer. That's the very summary of the bare metal server creation mechanism by using our newly developed driver. This is the instance management view for the bare metal developer. So this page shows the instance list. And as you can see from the top five, sorry, very small character, but the server type shows the test in three small dot metal and the dot metal flavor shows that they are the bare metal server. So in this project, there are five bare metal server and they are currently spawning status. And also, bottom three, they show these servers shows the virtual machine like CPU, one CPU, one gigabyte, some of the size of the SSD. So from the end of the perspective, they can manage both bare metal server and virtual machine by using the one instance view. And they can create any of the virtual machine over bare metal server by clicking the create the instance button. They can choose any of the flavor which is assigned to this project. Okay, then from now on, so I'd like to explain the more deep type detail of the feature which is provided by the bare metal server management system. First one is the stock management. Second one is HA group support. And the third one is development procedure with the bare metal driver. So as for the stock management system, we use Nova's host aggregate mechanism. And we support two types of the stock management, one is the stock types, one is the public and the second one is the private. As for left here shows the public stock for the old project. So public means the public. So any of the project like the project one or project two, project X, any of the project belong to the builder system could consume this type of the labor bare metal server if there is a stock in the host aggregate. And the right side shows the private of stock which is dedicated to the project of 10. In this case, this project of 10 have the two types of the labor. One is N3 small dot metal and the second one is N3 large metal. And each labor, each host aggregate have the multiple bare metal server. So in the project of 10, then can consume the assigned stock whenever they need to create the bare metal server. From the pre-assigned stock. And also these pre-assigned stock to the project of 10 cannot be utilized by any of the other projects. That means they are dedicated to stock for the project 10. And this shows the old stock aggregate show command list for the project 10. So we associate to the host which is pre-assigned to the project, especially project like ZZ1 or ZZ2 or ZZ3. They are the assigned pre-assigned stock for the UUID12345 project. So yeah. And also in order to request the pre-assigned stock from the end user project, end user can make a request from the builder UI. And when the user make a request of the private stock, the workflow is automatically created and the workflow will be arrived to the builder operator side. Then the builder operator validate if the end user request to pre-assign the bare metal server is suitable for their purpose or not. Then once we approve the workflow, the stock will be automatically assigned to the end user project then they can consume the bare metal server from the builder UI. Okay, next is HA group support. So in order to provide the application, line the application to very high available, so they require the location-based availability. So build supports the three types of the high availability functionality. One is multiple region. Second one is the multiple availability zone. And third one is server-like level failure domain. And the first one, multiple region and the second one, multiple availability zone. End user can choose any of the region or availability zone when they make a request from the stock workflow. And third one, server-like level failure domain. This is available by using the builder dashboard when user try to create a new bare metal instance. And the HA group supports some of the policies like the hard, soft, and non-policy. Or if the end user choose the hard policy, the multiple server must be distributed the multiple failure domains. It means the physical server must be distributed to the multiple server-like because if the one server-like have the power unit failure or the switch failure, that whole server-like will go down. So in order to avoid such kind of the issue, when user choose the hard policy, bare metal server must be distributed to the multiple server-like. And the second one is the soft policy. In this case, this case, try to distribute the multiple server to the multiple failure domain as much as possible. But even though it is not possible to distribute bare metal server creation, that's not fair. And this is a builder dashboard UI when user create to new instance, they can choose the HA group like the policy non or policy hard or policy soft from the UI. And this is a result when I tried to choose the hard policy and created the five bare metal server. In that case, as you can see in the right side, so the R something shows the rack number. So these five bare metal servers were distributed to the multiple racks because I choose the hard policy to deploy the new bare metal servers. So end user can understand where is your server is located in the data center region level or data center level and also the rack level. And the final one is the deployment procedure of the bare metal driver. Sorry, a little bit complicated figure, but we use GitHub and Algo CD in order to deploy the bare metal server management system control plain side. So in the GitHub side, so we store the bare metal server physical hardware information like the IP address, server Mac address or Mac address etc. And when the bare metal operator try to register the new server into Git, Algo CD automatically watch and sync, start to sync the bare metal server information into the bare metal Kubernetes environment. And after that, Algo CD try to start the unshift job which runs Kubernetes, which deploy the Kubernetes manifest, deploy the manifest or call him up into the bare metal Kubernetes environment. And also this job try to register the bare metal server information to the host aggregated information like the flavor etc. So this is a brief summary of how we manage the current bare metal server system by using the GitHub and also the bare metal Kubernetes service. Yeah, that's all for our presentation. Thank you for our attention. If you have a question, please go ahead. Let's start next one. Are you going to make the slide available? Nope. Are you going to make the slide available? Oh, I'm not sure I can't see nothing. Let me ask the staff later. Maybe they're... Maybe they'll upload the video to the YouTube so you could see it. Thank you. Thank you for presentation. I would like to ask two questions. First, how long will it take to deploy bare metal instance? Yes, so all the bare metal deployments procedure so that it contains the hardware boot sequence and also the OS installation phase. So in total, it almost takes 10 minutes or 15 minutes to boot up hardware and also the installation OS. Ah, thank you. If you do any... Sorry, second question is how often will bare metal instance will be deployed? Same frequency as a VM instance or not? So much. We did not measure how often the bare metal instance will be deployed. But currently, we support more than 36,000 bare metal servers. So very often, the end users try to create the bare metal server by using the open stack or the previous, our in-house bare metal server management system. So, yeah. Ah, thank you. Please continue. Thank you. Thank you for presentation. So first, one question is why user want to use bare metal for your bare metal has some specific hardware or something? So, I think the main reason why the end user try to use bare metal server is one reason is very they need to, their requirement requires very low latency or also the user use the virtual machine. So, the one hypervisor hosts multiple servers and sometimes it causes noisy neighbor situation. So, in order to very strict server or application management, our application developer still would like to use the bare metal server itself. It eliminates the noisy neighbor. Also, it can reduce the latency. Especially for the ready server or the database server, they try to use the bare metal server for their purpose. Thank you. Okay, thank you. So, it seems the time is all. Yeah, thank you for joining the session.