 Hello, everybody. Welcome to the session. Today we are here to explain about Oslo Matrix and how we are using Oslo Matrix in our OpenStack infrastructure. First about us. So my name is Rida Banji. I'm working in OpenStack for some time now and I'm currently working as an engineer in line cooperation with me as my teammate and colleague Motomu Utsumi-san. And he's also working with me as an engineer in the line private cloud. So the agenda for today's session is first we introduce a brief about what the lines infrastructure and what are the various problems which we have faced and what we thought of trying to solve it. How Oslo Matrix came into the picture and what Oslo Matrix actually is and how we are using Oslo Matrix in our OpenStack cloud. So a brief about us. So line cooperation is encompassing several services. It includes line messaging, line pay, games, line music, line TV, line manga and several other services. Different services in different domains. For example, fintech, commerce, artificial intelligence. We have ad and content based business as well and in different areas. So line cooperation is existing in different parts of the Southeast Asia and Southeast Asian countries including Japan, Taiwan, Indonesia and Korea. We have roughly about 166 million monthly average users with almost 84 million here in Japan. And the line infrastructure is actually divided into different parts. The main two components of the line infrastructure include the development environment, which is for the in-house developments and for the production environment, which is used for the production services itself. We also have a third part which is kind of a semi private and semi public cloud, which is used with other companies. We have almost around 2000 hyperbys, 2000 plus hyperbyses, which have worked with almost 50,000 virtual machines. We also have them around 20,000 bare metal and sufficient amount of cobin clusters as well. Almost 80% of this development environment is running on OpenStack with more inline. So brief introduction to the high level architecture. So we have divided lines architecture into three parts. The first is the infrastructure service layer, which is the base layer, which consists of the OpenStack components like Neutron, Nova, Glan, Ceph and it also includes load balancer and other bare metal components. We also have a platform as a service, which includes Rancher, Kafka, Trove and Elasticsearch in MySQL. And on the top we have another layer, which is a kind of an in-house layer, which is the functional service which runs Kubernetes and allows serverless computing. So with this, we would like to bring our attention to the problems which we faced. So there were different problems and when we started scaling up, when we started modifying our infrastructure, we faced different problems. And these problems actually motivated us to verify what are the different issues which we have in our infrastructure. So these problems included like RabbitMQ messages are lost or they are delayed in service. Certain RPC servers actually get exceptions and they stop working. Certain times the time taken by a server to perform an RPC is much more than the actual RPC timeout. And sometimes the RabbitMQ nodes fail due to high memory usage, high CPU usage, memory leak, too many open files. This is actually a problem which we have faced recently. And we also faced RabbitMQ screen drain problems, unsynchronized queues, etc. These problems are actually highlighted in a companion summit session which was given by my colleagues about one or two years back. And maybe we can focus on that session as well later on to know what are the different problems which we got at that time. While we got these problems, we started to understand and try to improve the situation. We found or actually we divided the different parts into different sections. So when we have faced a problem, we divided into different parts. So first of all, we started to verify the scalability issues. So we are going to scale horizontally or vertically. We need to verify what are the bottlenecks and what are the limits which cause us or which causes problems when we are scaling. So like if there are certain parameters which we need to configure, for example, if there are certain RPC timeouts which may increase with the increasing in scale. So we may need to modify those parameters. Any modification of architecture was also included. So we need to verify what are the different scalability factors which cause us this problem like the ones which I explained earlier. Then similar to this, we also focused on the reliability part. So how can we improve our open stacks reliability? What are the various ways we can monitor and track any latency issues? What are the various ways we can track the number of connections which are going on, the time taken for different connections and the average load because of these connections? Similarly, we also started to identify what are the various areas where we can insert our data points or hooks by which we can actually troubleshoot or investigate things much easily and much quickly. And this led us to Oslo Metrics. So first of all, what is Oslo Metrics? Oslo Metrics is actually an open source library and we already have a spec for this in OpenStack upstream and we have developed this project in-house. This Oslo Metrics project library actually collects the metrics exposed by different internal Oslo libraries themselves. What it means is that like Oslo Messaging, Oslo DB, we have different Oslo libraries. We place Oslo Metrics on top of these libraries and gather the metrics from these libraries and expose it to the administrators and the operators. And these administrator operators are actually the ones who are actually monitoring the usage of these Oslo libraries. So a role-based control, obviously. This Oslo Metrics can give us different information, for example, the number of RPC calls or the RPC exceptions or the amount of time taken for an RPC call. Similarly, how much time does it take to get a database query back? How much time does it or how many database queries are being hit? For example, say in a VM creation, in a network creation, etc. We also can use these Oslo Metrics information to find out the difference in the RPC calls. That is the number of RPC calls made or the amount of time taken by each RPC call as we scale up or we scale down. Or as we increase the number of API requests or the client-server interactions between two nodes. The Oslo Metrics part has taken some inspiration from the RPC monitor's implementation of Oslo Messaging. However, the RPC monitor is actually just limited to Oslo Messaging part while Oslo Metrics itself can be used with different Oslo libraries. So here we present a brief architecture. So as you can see in this, the simple flowchart is that we have the Oslo libraries on top. After that we have the Oslo Metrics and after that we can have any data collection entity, for example, from ETS. So we have shown here two basic deployments. That is one is an RPM deployment and one is a continuous deployment. The concept behind these are the same. What happens is we have, say a service, whatever a service can be like a NOVA conductor, a NOVA API neutron server on the controller side or maybe a NOVA compute or a neutron agent on the compute side. Whether they are running on a Kubernetes cluster or they are running on normal hypervisors. So whenever these services or wherever these services are working, they are going to use Oslo Messaging for the RPC mechanism. Now what will happen is that with the Oslo Messaging, the Oslo Messaging whenever it gets information from any particular service, say NOVA conductor, NOVA compute, neutron, when they get this information, they send it out on a particular exchange or with a particular topic. Now when they're sending this information, Oslo Messaging is actually sending out this information on a particular exchange. What we do is we send a copy of that information on to a socket file. This socket file is actually monitored by Oslo Metrics and whenever it gets that data from that socket file, this information is sent out to the data collection point. The concept is basically the same for both the container deployment and the RPM deployment. And this is just a simpleized version. So we have like the RPC client server on top. After that we have the transport layer and the ANQP driver, which is actually forming the Oslo Messaging layer. Then we have the RabbitMQ base or the RabbitMQ queue where all the messages are going in. Now when these messages are going in, a copy of this is actually sent out to the Oslo Messaging layer as well. Sorry, the Oslo Metrics layer as well. Now let me tell you a brief about the architecture. So what happens is that since the Oslo Metrics sends this data out, collecting it from a particular node onto a data collection point, that information may have some confidential critical information and we don't want to expose that. So it is ideal that you deploy this Oslo Metrics data collection on an isolated internal network, which is a very obvious point. The Oslo Metrics actually uses a UDP unit socket to communicate between the different services and the Oslo Metrics service. Say for example, we have different services like Mova API Neutron server on a particular controller. What happens is that both Mova API Neutron server, Mova conductor, scheduler, everything, which is sending out the information on the Oslo Messaging, the Oslo Messaging will send a copy of this information in a particular format to the socket file. So there is only one socket file for this whole controller node. All the services send out that data to that one socket file and the Oslo Metrics actually monitors that socket file and gets that information, processes it and sends it out to the data collection entity. So this exposure of data from Oslo Metrics is done in different formats. What we have done is we have used it in a Prometheus Exposition format. However, you can modify it to a JSON format or any other format and you can send it out to your own in-house or any other third-party data collection entity. So we have also shared the different message format, which Oslo Metrics expects from the different Oslo libraries. This includes a particular JSON format as explained here, which consists of a module, a name, a value action and a label. So there are different attributes or properties of each of these information. For example, a module can be the module which is sending out this data. For example, you move on Neutron. The name can be the different type of collection entities which we are collecting. For example, if we are collecting RPC timeout, if we are collecting RPC exceptions or what. The labels include what type of labels which we are sending out. For example, if we are trying to find out the RPC calls for a Neutron device update or a Neutron subnet list, what are the different information which we can collect? So those information are sent out as labels. And then we have the action property which consists of a particular value and a particular action which can be a data modification point. For example, if we are adding or subtracting a particular thing from the previous data point. So that can be mentioned in this action value. Now, my colleague Motomo Tsunsang will be presenting how we are using Oslo metrics and lines open stack environment. I will demonstrate how we are using Oslo metrics in our private cloud. Currently, we are mainly using Oslo metrics for three purposes. Metrics visualization, trouble shooting and metrics trend monitoring. I will explain these usages in detail from next slide. I will first show you metrics visualization. We are visualizing metrics using parameters and Grafana. So let me show you actual Grafana dashboard we are using. This is a Grafana dashboard for Oslo metrics. In this section, we are putting RPC cryon side metrics. Invocation count, exception count, average processing time and 95% tile processing time. I'm showing the dashboard with the time range when we have outage. So you can see many exceptions. And this section is RPC server side metrics. Same to RPC cryon. We have invocation count, exception count, average processing time and also 95% tile processing time. And next is messaging queue driver layer metrics. We are displaying send account, deprived waiting account and also some error and exception metrics we should not see in healthy status. And last is LabitMQ implementation layer metrics. Most of them are metrics for error and exceptions. This is a current our dashboard for Oslo metrics. I have showed the dashboard. So next let me show you the metrics change behind OpenStack operations. These figures displays a metrics change when we build a single instance. The figure illustrates RPC invocation count for each method. You can see the increase of some methods like updating instance info and update device app. And light figure illustrates the processing time for each method from server side. Y axis is log scale, so difference is very small but you can see some difference between the methods. This metrics change itself is obvious and it's not so interesting but next metrics change is more useful. This figure shows the select destination processing time from server side when we schedule instances. I guessed it with 10, 100 and 500 instances. You can see select destination takes 20 seconds for 10 instances, 35 seconds for 100 instances and 140 seconds for 500 instances. We set 120 seconds for RPC timeout, so scheduling 500 instances exceeded timeout threshold. These are the example metrics changes behind operations. Next, I would like to show you troubleshoot example ideas using Oslo metrics. One day user reported that one instance has deprecated volume attached entries so I started investigating this issue with Oslo metrics. I first checked if we had error or exception related to volume operation so I checked the dashboard for RPC exception. I could find the spike we did not have regularly and error was messaging timeout for reserved block device name and which was happened in Nova. This was suspicious, this error could be the reason of the issue. I also wanted to see the average processing time of reserved block device name and also how this exceeded timeout threshold. So next, I checked the dashboard for reserved block device name processing time and I could see processing time increased gradually and exceeded timeout threshold. After I gathered this information, I dabbled into the code and investigated why processing time increased. By using Oslo metrics, I could easily identify the suspicious code. When we had the issue, we also want to know whether we had the same issue before or not and whether we likely to face the same issue soon or not. We can easily know this from the dashboard. We can see the average processing time in the past. Last major usage is metrics trend monitoring. The purpose of this monitoring is preventing issue in advance. I think RPC processing time would be a good example we can use trend monitoring for. This figure is example time series change of RPC processing time. If we see this, we can easily predict RPC why could exceed timeout threshold next quarter. This is very straightforward example, but I think monitoring time series change of metrics is useful for other cases as well. We have deployed Oslo metrics at the start of last quarter, so we don't have enough time series data yet, but we will keep tracking metrics trend and I hope we can share interesting results in the future. I have explained what we have done. Let me also describe what we are doing now in progress project. Integration with other Oslo libraries and bottleneck analysis with hypervisor emulation. As for Oslo libraries integration, we have not created detailed metrics list we want to correct yet. So these are very last draft. But we are planning to integrate Oslo metrics with Oslo DB, Oslo concurrency and Oslo service. For Oslo DB, we want to see the query time and number of queries. For Oslo concurrency, we want to see the time to acquire lock and lock health duration. Actually, in the example of troubleshoot I explained in previous slide, I showed the increase of RPC processing time and the reason of that increase was increase of time to acquire lock. So if I had had this metrics at that time, I could have finished troubleshoot more easily. And last for Oslo service, we want to see the periodic job processing time. Next, I will explain hypervisor emulation project. Let me first explain the background a little bit. Light figure illustrates that our cloud growth from the perspective of the number of hypervisor. As you can see, number of hypervisor increased around 500 per year. Our company is growing and also we are actively migrating existing services to open stack cloud. So we will keep adding the hypervisor even more. We faced some scalability and performance issue before, but until now we often fix the issue after we faced the issue. For example, we observed select destination with 100 instances got time out and we improved some logic to make faster. One of the big factor which decides the sex destination processing time was number of hypervisor. However, to provide reliable private cloud, we want to understand the scale limitation and we want to solve the issue before we hit the issue in our production cloud. So the goal of this project is to be able to answer the question like, can our cluster handle 10,000 hypervisors? If we can prepare the test bed with 10,000 physical hypervisors, that's great. But of course we cannot prepare only for this benchmarking. So we want to have emulated hypervisors. Using small virtual machine for hypervisor is also one of the choices. But if we consider assimilating 10,000 hypervisors, virtual machine might be still expensive. We are still investigating, but for now this is the design draft. We will run container which emulates hypervisor on Kubernetes. And we have fake Neutron agent and fake Nova compute. They work like usual Neutron agent and Nova compute except for actual provisioning part. So that we can simulate the role of the target number of hypervisors. By using this emulator, we want to understand the scale limitation. And if we find some issue, we want to tune parameters, modify calls, or modify architecture to improve scalability. In this improvement process, we think Austral metrics would be useful. This is pretty much for hypervisor emulation project. As I mentioned, this is still early stage. So I hope we can share interesting results in the future. We are also working on upstreaming. These are some links for our activities. And last, this presentation explains the background of Austral metrics project. So please check this if you are interested in. This is all for our presentation. Thank you for your attention. And if you have any questions, we would like to answer them now.