 OK. Ladies and gentlemen, it's an honor to have the opportunity to stand here to address such a distinguished audience today. Let me start by saying a few words about my own background. My name is Han Chenlin, who is the technical director in T2 Cloud. I lead the quality assurance team and do a lot of DevOps and OpenStack IND work in the past few years. Actually, before I joined the T2 Cloud, I have ever worked for Citrix over seven years and paid off of my efforts on the zone-related waterization, computing, and cloud computing-related product test work. That's all about me. The subject of my presentation is the challenge of OpenStack performance optimization for 800 nodes in a single region. Actually, it's the practice that we did before. In the last year, we worked with our partner, China Railway Corporation, to verify whether OpenStack can support 800 nodes and 100,000 virtual machines in the single region. And we make it happen. So today, I want to share our experience and our idea and the solution to you. And I'd like to have a fuller discussion. I think most of you have no sense about our company, so I want to take up one minute to talk about our company brave. T2 Cloud is the leading private cloud solution provided in China. Also, we are the startup. And our core team is dedicated to OpenStack from 2011. And we're also the corporate sponsor in the OpenStack Foundation. We build the largest-scale practice in transportation industry, as I mentioned before, China Railway Corporation. And we have a great variety of products, including T2 Cloud OS, Magic Stack, HCI, and our hosted private cloud. OK, that's all about our company. Let's get back to a point. Let's take a look at our agenda today. My presentation is divided into four sections today. In the first section, I want to talk about the background. And you can see what the project goal is and the architecture and the test that we did before. And in the second section, I want to pay most of my time to introduce the issues and the solution we provide. In the third section, I want to give you some configuration suggestions if you want to deploy a large-scale OpenStack cluster. And in the last one, I want to have some conclusion. OK. Let me take a look at the background first today and check what's the project goal. I think most of you definitely doubt why you accommodate so many nodes in a single region. What do you separate them to different regions? It's ridiculous. Before I answer these questions, I want to let you know what client we are working for. China Railway Corporation. But who is the China Railway Corporation? China Railway Corporation is the largest and unique railway in China. The China Railway Corporation has two billion employees in China. And the total assets of the Railway Corporation is up to about $700 billion. It's magnificent, right? And the railway manager in China have already reached to 120,000 kilometers. And the high-speed railway is also up to 19,000 kilometers. And during the Spring Festival, China Rush of this year, the China Railway Corporation sailed with about 400 billion person ships in 40 days, just 40 days. And they carried over 700 billion tons of commodity in the first quarter of this year. So it not only bring the tremendous challenge for railway, but also the e-commerce website and their IT facility. And the China Railway Corporation is going to start moving their system to open stack. So we need to modify the stability, scalability, and performance ahead. If we can put 800 nodes and 100,000 virtual machines in the single region, we have sufficient confidence to persuade them to move as many of their systems into the open stack. So that's our purpose to do the verification. OK, in this slide, you can see our topologic diagram and the hardware we use and the work test we did. In order to make the test more convincingly, actually, all of the verification is based on real productivity devices. You can see we have three types of storage media, include SAS, SSD, and PCIe SSD. Actually, the SSD and the PCIe SSD is working as a journal cache of SIF. And the testing environment consists of 600 computer nodes, 117 storage nodes, three controller nodes, five SIF more nodes, and any other nodes work as bare metal, big data analysis, and some monitor nodes. So all of the nodes is about 800. And we create 100,000 virtual machines to verify the scalability of open stack. And that will make it happen. And you can see we involve IPerf, FIO, and really to test not only API performance, but also the storage and the network performance. And we have some dedicated automation tests to the database and the RabbitMQ. And we mock up the RLIP and the RLTP running on the 100,000 virtual machines over 40 hours, a day, seven days a week, to make sure our open stack have sufficient stability. OK, that's all about the test overview. And in this slide, you can see the architecture. You can see we have the three controller nodes and use the Red Hat 7.2 as our operating system. And we use the Liberty as our open stack version. And the MariaDB is our database. And we use the RabbitMQ as our message queue. And we use Keep Alive and the HEPOC to make sure the high availability of controller nodes And actually, we use the Linux bridge, but not the operating switch, because the Linux bridge is more stable. And we use the sieve as our storage. Let's move on to the main section, the issues and solutions. I got first things. I'd like to talk in this section, including my circle, Newton, Kistol, and the sieve. Let's move on to our first part, my circle. Actually, in the course of our test, we encountered my circle performance issue. And we have some solution to provide. So I want to talk about them. As we know, actually, the general cluster supports multiple primary nodes. So theoretically, all of the DB requests can be leveraged to a different database nodes. So it seems like the mechanism is exceedingly reasonable, and the performance will be good. But it did not work for OpenStack, definitely. Let's have a diagnosis and find out the region. Suppose two transactions went to update the same row simultaneously. threadA and threadB will execute the slack for update forcefully to get the same row simultaneously. And the second step, the load balance will leverage the queries to different nodes and receive the same row for update. But let's assume the commit of threadA is accept forcefully, and the bin log have already been synced up to other nodes successfully. So threadB will return the data log error because of the data inconsistence. So actually, we cannot make these things happen. So as the community suggests, we need to replace the multiple primary to single primary mode so we can avoid the data log. Just like the illustrator said, we have to enable the master and backup mode in the HAA proxy to make all of the DB requests navigated to the same row exclusively. So in order to avoid the data log, we actually abandon the concurrency. So do we have our choice? We don't want the data log, but we need the concurrency too. Certainly, we have a solution. Let's check the first chart of our query proportion. As our statistic said, actually, we can find the request of rate accounts for 62% of all of the NOAA database query. If we check the Newton, we can find the number is even reached to incredible 80%. And if we check the total query proportion, we can find the number is also up to 73%. So we can conclude the most of our requests of database are rate. So if we cannot optimize the rate, why not optimize the rate? If we can optimize the rate, we can improve the performance significantly. OK. But how to distinguish rate from rate? Our solution is involve a database proxy to spread the workload. And we compare the different products, including MySQL router, MySQL proxy, MySQL and Mycat. And finally, we choose Mycat as our choice. But today, I will not discuss the pros and cons about different products because I have no time to explain them. OK. Let's check how the Mycat work. If the VIP is in the control one, so all of the DB requests will go to the control one. And the Mycat in the control one will receive the DB request. And we navigate the right request to the local DB. But if the request is the rate request, it will be navigated to different databases. And if the Mycat or local variable DB is crushed, the VIP will fail over to the different nodes. Let's suppose the VIP is moved to the control two. So the Mycat in the control two will be the hotspot. So all of the requests will be get to the Mycat in the control two. And the local database in the control two will become the right node. However, we really split the rate or write for open stack. The answer is nope. Let's check what happened or what we encountered before. Before we elaborate the upcoming issue, I need to let you know how Mycat distinguishes rate from right. If the statement includes flat, the Mycat will treat it as a rate request. But others will be treated as right, including insert, update, and delete. But for the transaction, it's different. We will have the session variable auto-commit to let Mycat know what is the rate request and what is the right request. If the auto-commit equates one, it means rate. Or if the auto-commit equates zero, the Mycat will treat it as right. But the problem is we found all of the SQL statement in our test was wrapped into transaction with context auto-commit equates zero. Why? The criminal is the SQL army. SQL army will construct the session context with auto-commit equates zero, whatever rate or write request. So we need to resolve this issue. If we cannot resolve this issue, we cannot distinguish rate from write fundamentally. So our solution is changing the SQL army logic to pass the rate-only statement that is auto-commit to Mycicle and alert SQL army query commit and roll back logic when pass the rate-only statement. Because the SQL army don't expose any API to us to change the value, so we have to update or have to alert the logic of SQL army. It seems like everything is OK now, so far so good. But the fact is it's still not work. OK, let's check the second issue. Safe point. I'm not sure all of you know the safe point that means. I want to explain what is the safe point in brief. Safe point means the transaction, nasty transaction. The purpose of the safe point was supposed to save transaction roll back cost. So if the transaction fails, you can roll back to the specific point, but not to the start of the transaction. So it can save the cost. But the new thing is to avoid that lock and the risk condition, we have to work around to update the new code logic to remove safe point tentatively, because we haven't found our choice to make the safe point happen, because my cat did not support safe point. OK, that's all about what we did for MySQL. And in this slide, you can see we are going to investigate MySQL group replication. As MySQL said, the group replication will have a higher performance than the MariaDB, but we still need to do some tests not only about performance, but also functionality. Actually, we have found some limitation of the group replication, such as the group replication required primary key on every table. But as we know, some table in the neutral or ironic actually have no primary key, right? And actually, MySQL group replication have his own features and his advantage, like he use the pixels and have the conflict detection mechanism to avoid that lock and so on. That's what we need to do in the future. OK, let's move on to our second part, the neutral. The first bottleneck we encounter is the L2 population. As we know, L2 population is the mechanism driver of ML2 plugins, which tends to leverage the implements of overlay network by populating four-in-table of what is very cheap. The L2 population will decrease the broadcast traffic in the physical network of fabric while using overlay network, right? The L2 population works in this way. If the port changes, like we put a VM in the host, the L2 agent will scan any changes on this host and update the port status to the queue plugins in the ribs queue. And the neutral server will receive this message and notify the L2 population mechanism driver to think out FDB entries to all of the L2 agents so that they can update their local four-in-table to up to date. It seems like the mechanism is extremely reasonable. But actually, it has been proven that it's so easy to make a ribbed MQ crashed in the cost of our test, large-scale test. Let's check what happened. Suppose there are multiple virtual machines. Number is M. In multiple computer nodes, the number is M boots up simultaneously. So what happened? The total RPC messages will be M times M. But please don't forget, we have 100,000 virtual machines and over 600 computer nodes. It's an incredible disaster for a ribbed MQ. So the ribbed MQ will crash if you put so many messages to a ribbed MQ. We call it RPC storm. It's the first issue we encountered. And suppose there are another scenario. There are multiple active ports. Number is X in all of the actual agents. Number is Y. Are we started in the same time? It will trigger extra request to ask DB to retrieve all of the FDB entries in the same network. And X times Y message to ribbed MQ. It's also a disaster for both DB and ribbed MQ. So how to resolve it? Our solution is to add the cache and involve the zero MQ. If we can alleviate the pressure of DB and ribbed MQ, actually we can resolve this issue. So we add the cache to relieve the pressure of DB. Store the FDB entries in memory as cache. And if any change on the port, we update the cache immediately. And the actual population mechanism driver only retrieves the FDB entries from the cache, but not the DB. So it can moderate the pressure of the database. And if we can move the pressure from the ribbed MQ to somewhere, actually we can moderate the pressure of the ribbed MQ too. So we involve the high-performance message queue, the zero MQ, to resolve this issue. As we diagnosed before, we found 50% of the ribbed MQ message is the actual population think-out message. So we need to move them to somewhere to the zero MQ. So we use the zero MQ to undertake the think-out message. Actually, the community has its own idea and solution to alleviate the pressure of a ribbed MQ. But before I elaborate them, I need to point out the fact if the message update, actually the actual population mechanism driver will think-out all of the FDB entries to all of the other two agents. But it's not necessary. It just needs to think-out the FDB entries to the relevant other two agents, but not all of the other two agents. So the first and the second solution is to resolve this issue. And the third one, the community provider is the BAG pipe to use the EVAP and the BGP to alleviate the pressure of the ribbed MQ. So ribbed MQ is in order to undertake all of the think-out message, move all of the pressure to the BGP server. You will have some interesting solution you can refer to the link. OK, the second issue of the solution is the report state in the course of our test. We found a great majority of Linux branch agents are done constantly. It needed to VM creation was failed. But what happened? Let me check how the Linux branch agent work. The Linux branch agent will report the configuration info to the Neutron server periodically. If the Neutron server works well, it will update the message to the database and mark the relevant Linux branch agent as up. But if the Neutron server is busy, just like the largest scale test, the Neutron server is busy, the Neutron server will not update the config info to the database and mark the Linux branch agent as done. And if the Linux branch agent did not receive any response from the Neutron server, it will think the Neutron server is so busy. So it needed to increase its interval to avoid send many pressure to the Neutron server. And the Linux branch agent will use the exponential back of algorithm to increase the interval. The default interval is 30 seconds. But the max interval will be increased into 300 seconds. And even worse, the interval will not be decreased back to the default interval until you restart the Linux branch agent. It's ridiculous. So if the interval is up to 300 seconds, actually all of the water machines scheduled to the host is failed constantly. So we have to resolve this issue. Otherwise, we cannot provision 100,000 water machines in one day. So we provide the first solution is we add the use local mode to permit the Linux branch agent can update the MySQL directly and immediately. So it will not bring the project to the Neutron server. And the second solution is to update the Neutron logic to remove the exponential back of algorithm to fix the interval mode. Let's move on to our third section, the keystone. In the course of our test, we found the keystone is hung up in order to accept any request. It's so easy to make the keystone crashed. So have we analyzed the test report, analyzed the code, we found the massive request for token will lead to tremendous access to DB. But the database cannot respond in time. So the keystone will hang up and cannot accept any request. The solution for this issue, we can add the memocache and add the memocache to keystone and the keystone middle well. So all of the token will be stored in memocache, but not in the DB. And the keystone middle well, we verify the token, use the memocache, but not the database too. And if we check the illustration in your right hand, you can see the NGIC actually have 9,000 concurrency per second. It's higher than the WSGI. So we also involve the NGICs and UWSGI to resolve this issue, to moderate the pleasure of the keystone. And we also have some configuration optimization, like increase the number of public worker and admin worker and some configure options in the NGICs. OK, I think the time is against us, so I need to expedite my speed. OK, the second issue of the keystone is actually not a performance issue, but it's a high availability issue. When VIP moved to the other controller nodes, actually we found spawning VM failed. But why? As we know, high availability mechanism is to resolve two types of issue. The first one is the data loss. The second one is the system downtime. But as we know, memocaches have no concept of cluster. So it just resolves the system downtime, but it will lose the data if the VIP moved from one controller to the other controller. And it will lead to an issue that NOVA will include a neutral V2 to verify the token, but in the controller tool, actually we haven't existed the token, so you will fail it. So our solution is to add logic to the clientV20.client by invoking reset state to reset being also in the session. So if a request to come in the next step, the token will be reapplied, so we can resolve this issue. OK, the last part of the issues and the solution, I want to talk about something about the sieve. OK, the problem is in the course of our test, we found that the most OSD and the most are done constantly. And we found out a reason and figured out. And actually, we use the default message mode of the sieve. That is the simple. And we found each OSD demon will create about 1,000 threats in the test. The system results are exhausted, especially the PID. So we need to resolve this issue. We enable async mode, but not the simple mode anymore. And we increase the maximum PID numbers in the files you see in the PowerPoint. And the second issue we found is the sync flooding. We get the sync flooding error in log. And we found that we have to increase the max sync backlog to over 8,000. And we also increase the number of the sieve mode to avoid the sieve mode is crushed. Because if the number of water machines reach to 20,000, the sieve mode is so easy to do. And in this slide is our third section. I provided some configuration suggestion. In the left of the punctuation, you can see the config option. And in the right of the punctuation, you can see it's the default value. But you need to adjust the value as your real requirements in your open stack. In different environment, you need to increase the number to different value. Inclusion, I need to tell you, we need to pay a lot of efforts to test and improve the performance of scalability and scalability before starting large-scale enterprise deployments. And the second conclusion I will tell you is the fitness is the best. Just like using the use local model or use the fixed interval to resolve the report state issue. And the third one is to advise to real deploy 100,000 water machines in a single region because it will bring exceedingly trouble into your operation. It will be operation disaster for you. The last one, I want to say the more tests you do, more stable your cloud will be. OK. Thank you, everyone. That's all about my presentation. And if you have any questions, I think you can come to our booth. Our booth is A25 in the marketplace. OK. Thank you, everyone.