 Cinder and high concurrence pressure and the larger-scale environment. Okay, my name is Wu Yutong. I come from AW Cloud Corporation. Our company is based in located in Beijing, China. And we provide OpenStack Power, the solution for enterprise. Okay. Okay, this is the contents I will introduction this afternoon. First, I will describe the problem we have met. And then I will give Cinder deployment architecture. Then I will talk about some factors that will affect Cinder performance. And I will give some solutions about each factor. At last, I will compare the results before and after turning. Okay. First, we have some problem when we use Cinder. The problem is that we can't create, Cinder can't create a volume quickly enough as we need. To illustrate the problem, let me introduce the project first. We deployed 200 node OpenStack environment. And we use Rally to evaluate the performance of this cluster. The factor is that under concurrence pressure up to 200, some instance will put failure. Because some reason that we need to put an instance based on volume. So we analysis this phenomenon and we found that the key factor lead to failure is that Cinder cannot create volume quickly enough. Okay, this picture shows the problem. No one commuter send a request to Cinder to create a volume and No one commuter will wait for the volume until it's ready. However, Cinder cannot create volume quickly enough. No one commuter will time out and the instance will put a failure. Okay, this is development architecture or Cinder and Okay client to send a request to HAProxy and HAProxy will load the request to Cinder API and then to Cinder schedule and a Cinder volume. The error shows the request follow pass. At this there are so many components in this follow pass and each of these components can be a performance factor. This includes HAProxy, Cinder API workers and number of Cinder volumes and database, storage driver and others. I will talk about each one in detail. First, let us take into the HAProxy. When we use Rally to benchmark our cluster, we found some fail-off for error in HAProxy log. We didn't know at first we can't determine why HAProxy reported fail-off for error. We didn't know whether it comes from the HAProxy itself or from the backend services. So we removed HAProxy and we have another test. Okay, this picture shows two different test cases. In the upper one some fail-off for error happened and in the second test case, our Cinder volume created successful but it's slow. So we can determine that the problem is happened in HAProxy and we announced we found that HAProxy version is too old and after we upgrade HAProxy from 1.24 to 1.11, the fail-off for error disappeared. So after now HAProxy works well. Let's get into the Cinder API. Cinder API there was an obvious phenomenon. If you run high workload, Cinder API will take a longer time to process each request and there are at least one reason it can explain this phenomenon. That Cinder API, most all Cinder API work is to access the database. However, the database connection driver is not patched by UNTEL. So if there are many requests, our request will be processed by Cinder API in theory. So Cinder API is a performance backlog. The solution is very simple. We can run several Cinder API workers to solve this problem. In our environment, we set the OS API worker to 10 at each node. Up to now, Cinder API can work well for us. If the request is to create a volume, Cinder API will send a request to Cinder schedule. But if you help other operations, the request of Cinder API will directly send it to Cinder volume. So Cinder schedule is not a light workload. It's not a performance issue. So next, we're looking to Cinder volume. Compared to Cinder API, Cinder volume has a more serious performance issue. Cinder volume is a heavy workload service. Especially in our environment, we built an instance based on volume. Cinder volume should create volume that initializes the connection and at last attaches the volume to the instance. When the concurrence up to a certain level, Cinder volume will work not fine. Cinder volume cannot consume messages from the MQ immediately under certain concurrence. And maybe you know that at a normal condition, Cinder volume, create a volume may take several seconds. But under a high concurrence, it may take several minutes to create a volume. So Cinder volume is a performance bottleneck. And the solution is very similar to Cinder volume. We help to run Cinder volume workers. However, this is not supported by community. And it's our private patch. Up to now, this is not the end of story. And if you run multi-cinder workers, some bad things happened, unacceptable. If you run multi-cinder volume on a node, there will be risk conditions happened. There are at least an example is volume extend. If two Cinder volume workers accept the request on the same volume, there will be risk conditions happened and lead the database in inconsistent states. So we must fix this problem. And the solution is also not complicated. We can add locked volume operation decorator to extend the volume in volume manager. This is also our private patch not supported by community. Next, we will talk about storage driver. As you know, Cinder volume uses storage driver to communicate with each backend storage. And in our environment, we use safe as the backend. And we found that RBD and rados are not patched up by even tonight. So if we run Cinder volume on high concurrence, some Cinder volume workers may become a long process. The phenomenon is that Cinder volume, you can use a piece to find that Cinder volume is still work. But it doesn't accept, doesn't process the request. It's a long process. And if you run some long running tasks in Cinder volume, that will block even to loop. So we have to fix it. Fortunately, community has a solution about this problem. You can backport this patch. This patch is our Cinder volume. Next, we will talk about something related to Glance. Because we create a volume, we put an instance from volume. So Cinder volume needs to create a bootable volume. In a normal process, Cinder volume needs to download the image from Glance and convert it to raw, if necessary, and copy the image to the volume. However, you know that download the image and convert to raw and copy the image to volume are this process time-consumed. So if we want to boot volume quickly enough, we must change it. In our environment, we use Cinder as a back-end. So Glance also supports use Cinder as a back-end. So we configure Glance also use Cinder as a back-end. And each image will map to snapshot in RBD in safe storage. When you create a volume from the image, that will only create the volume by copying right. So it's very, very quickly. But it still has some conditions. First, you must configure Glance use RBD store. And the image you upload to Glance must be raw disk format. Next, I will talk about is database. We found that there was a phenomenon is that when you deploy a new caster, the create or delete volume or create or delete snapshots are very quickly. But if you run the caster after some time, the create and delete operation on volume and snapshot will become very slow. And so there was something happened in database. We use Mojang to analysis the database performance. And we found that database acceptably mainly consumed in the evaluation table. The evaluation table is resource resolution for Cotter. In standard database, the evaluation table is faster growing in table. That is because in the lifecycle each volume are also including a snapshot. At least four inches will add it to the evaluation table. That means if you create a volume to add it to the evaluation table, you delete a volume, two inches will add it to the evaluation table. So in our development, we found that if the number of inches in the evaluation table up to 400,000, the database performance will have a sharp decline. The number of inches is 400,000. And there is another problem in that problem. Usually state entry is not clear. So the solution is that we must add a combined index for the evaluation table. I have some interest in this patch, but I have not merged it in community. If you run large open-stack environment, there are some other configuration items that need to be considered. You must increase the RPC response timeout. And if you use the RMQ for the RMQ and RPC timeout also need to be increased. And if your cluster has more than 1,000 volumes, you also must increase our SAP MX limit. This value limits how many volumes we will return to a single list request. The default value is 1,000. At the last, I will show the results before and after turning. We use Rayleigh to benchmark our cluster. I will show you two test cases. The first test case is that we use NOAA servers. The test case is boot server from Orleum. The right picture shows the results before turning. The result is that when the concurrence up to 200, some instance will boot failure. You can see the red line in left picture. That means that something is wrong. Something instance booted failure. And the right picture shows the results after turning. After turning, when the concurrence up to 200, our instance booted succeeds. You can get that from the right picture. Another test case is that we create normal volumes, not bootable volumes. Just as above slide, the left picture shows the results before turning. Before turning, when the concurrence up to 1,000, some volume will not create successfully. You can see the right line in left picture. But after our turning, when the concurrence up to 2,000 and fail 100, our request will be successful. That's it. That's all. This is the development architecture. We determine the number of the API by HAProxy log. HAProxy has much information about the request. In its log line, the field about the response time, or your service, backend service. We can determine the number or API by the response time or each request. You use Rayleigh to benchmark the cluster and you pay attention to the response time or each request. You can increase the number of the API until the response time that matches your necessary. Can you repeat it? When you set your sender API, pretty much adding to his question about you have n amount of sender APIs. How did you come up with the number? Did you have to do multiple tests and then just kept on adding sender APIs and watching HAProxy? Or did you say if HAProxy's load or threshold is below something milliseconds, then we know that we're at the right number? See what I'm saying is we don't necessarily know when somebody is going to spin up 200 VMs or 500 VMs. So the issue is that we need to be able to monitor HAProxy's threshold in order to spin up more sender APIs in order to meet that demand. We can figure out the workers of the sender API and then we restart the sender API services and have the test use Rayleigh to bank load the sender API. You just had preset tests? Yeah, preset. I would like to ask, during your test, did you try increasing the number of database requests? I mean, usually the default configuration only allows up to 15 concurrent database requests per service, so the API can only make 15 requests to the database concurrently. Did you try increasing that number? No, we did not do that. Yeah, you should probably try because you can see like it can go, if you have high concurrent requests, for example, for an attach that is taking like two minutes, you can see drop to 20 seconds just by increasing that number from, I don't know, it's five by default plus 10 max overflow. If you increase the max overflow to something like 90 or 100, you will see everything go a lot faster, mostly the sender API and the sender volume nodes. The bottleneck is not the database but the default amount of requests that Oslo was setting the values to, so Oslo database was not setting any defaults to the number of database requests that any service can do, so it went to the SQL Alchemy library defaults which were five and 10 max overflow, so if you had 1,000 API requests concurrently, you would only have 15 database requests from the API to the database, so you could start queuing all those requests and they could be serialized from 100 or 1,000 to 15 in the database, so that's when you would get the bottleneck, so if you reduce like you did the number of concurrent workers, let's say to the test soul that it's usually best to have approximately the same number of threads in sender API as the number of database connections. There is a good mail thread by Mike Bayer about the tests he did in Nova that I also reproduced in sender, so that's also another way to fix many of the problems. Are the patches that you referenced available in a public repository anywhere? Which one? The unsupported patches that you mentioned that are not supported by the community, are they still available publicly if one of us wanted to use those? Yes, the community has patched but not merged. Okay. Due to the limited time, last question. I just wanted to mention I don't know if it's the next one or the previous slide when you talk about the API races, okay, the race conditions. In Mitaka we have fixed a bunch of those, API races instead of using logs, we are doing comparison swap in the database and the remaining API races are there are patches upstream that are pending review, for example, for the extent, so they should be fixing in Newton, okay? Yeah. Okay, that's ours, thank you.