 Hello, everyone. My name is Jin Haigong, and I'm from Inspire. The topic I share is the performance optimization of Neuron server and the agent for large-scale scenarios. In September of last year, Inspire and inter-genetic conducted a 500-point large-scale test of OpenStack ROK version to discover and solve the problems of using OpenStack in large-scale developments. At that time, the test platform used more than 500 servers based on inter-genetic cascades like processor using inputs across cloud system. For example, we will share the solutions of agent problems we encountered during the last-scale test. In last-scale testing, an important test scenario is the concurrent creation of a large number of watching machines for network and nutrient. It will create a pause in the large quantities based on the scale of the cluster at the time. We were able to keep the code unchanged and only change the parameter of nutrient server which was able to meet the demand of concurrent port creation. Although there is no more port creation incident, there will still be problems with multi-virtual machines creating failures. The reason for the failure is to create a virtual machine and wait for the network card to come out. That is the time that the network card is up exceeds the default waiting time of NOVA. In response to this situation, we have tried some solutions. One is to increase the waiting time of NOVA compute, and the other is to make NOVA compute no longer wait for the network card to be up. Both of these solutions can achieve mass creation of virtual machines successfully, but each has its own problems. One is that it takes a long time to create a virtual machine. The other is that although a virtual machine can't be created quickly, it cannot obtain an IP address through DHCP, or because there is no flow table virtual machine, the network is actually blocked. You need to wait for the network card to be up before the network is really connected and the virtual machine can be used. On the other hand, less of these two solutions found mainly solved the problem. We analyzed the app process of the port in deeps, and found that DHCP subnet port was opened. Plus, the neutral server will aid two blocks to the portal, which are two entrances right into the database. One is DHCP, and the other is OpenWay switch agent. There are two agents. Only when these two blocks are removed, the neutral server will set the portal status to app, and then notify the network commuter that the network port is as active. One of the two blocks requires the DHCP agent to report. The portal DHCP is ready, and the other requires the OpenWay agent to report the portal to the app. Our analysis should be that the removal of blocks is slower. Therefore, we will create a large number of virtual machines and observe the removal of blocks during the observations. It was found that a large number of DHCP blocks were not removed, causing the creation of virtual machines to tumble out. Then, we went to observe message queens sent to the DHCP agent and found a large backlog of messages. So we judged that processing power of DHCP agent was infinite. We first analyzed the processing logic of DHCP agent to discover some problems are found in the handling of DHCP agent. For example, the DHCP agent processes the results under the same network, such as network subnets and ports, which are mutually exclusive and exclude secretion. In addition, the DHCP agent processes a port message, usually a time of one to two seconds less, when a large number of virtual machines are created. The number of port messages generated is at least three times the number of virtual machines. When the DHCP agent processing speed is lower than the generation speed, a message backlog will be generated, and the virtual machine will be created and waiting for the new component to active to tumble out. After discovering such a problem, let's find out if the community has encountered the same problem for the first time. Our product is based on the one stack raw key version. We found the latest code in the community and optimized the DHCP agent, including the prior attention of the DHCP messenger and prior attention to the processing of port creation. This will slow the timeout of creating virtual machines in some case, but when creating a large number of virtual machines, a larger number of creation messages are generated, and optimization is limited to the community. The community also considered the network results locking machine and used the same network results for secondary cleaning processing. This optimization can solve the problem of processing a large number of messages. In one network without affecting the processing of other network messages. However, this optimization basically does not work when creating a large number of virtual machines on the STEM network. There is no fundamental solution to the problem of the processing of STEM network message. In response to this situation, we proposed our own optimization plan according to our analysis. The time consuming process is ultimately the processing of the DMSQ process. The final processing of DMSQ is nothing more than restarting the DMSQ process, disabling or enabling the DMSQ process. Therefore, a messaging machine is adopted. If you find that there are a lot of messages from the STEM network, you will only process it once and occlude the DMSQ process once. The specific processing means that the messages sent by the nutrition server to the DMSQ agent are not a central need. But the messages are captured. Then start a whole team pool to get the message from the cache, get all the messages of the STEM network at once. Then the final behavior of this message is just another DMSQ process is occluded once. The core routine processing in the core routine pool adopts the master of event notification. When there is no message processing, the core routine pool is for the event notification and does not win and sleep. And immediately when there is an event notification, it can be generated that when there are a large number of messages, the core routine has been processing without any waiting. When more adults do not run empty, then we conduct a competitive test on the community plan, the new community plan and the optimistic plan. We did not adopt the master of virtual machine because the way to create virtual machine is that the amount of current is limited by the various components of NOVA, and the amount of messaging current currency at the time of the DMSQ agent might be uncertain value. What we use is to directly stimulate the messages sent by the nutrition server to the DMSQ agent, sending the message directly to the message queen through the script, and then the DMSQ agent to obtain on the process. We first use the Roky version of the code to test, send the same message, send two messages to the pool, second, from this order of magnitude, the community's Roky version code could not generate, and as soon as there was a message backlog in the message queen, then we use the latest DHCP agent code to test, send the message from the same network, sending two messages from the second, it took a total of 10 minutes. In this order of magnitude, the new community code can generate, but it needs to be delayed by 6 minutes, and all the messages will be processed after 6 minutes. Then we tested the optimist scan, still sending message from the same network, sending 10 messages per second, and the code of 10 minutes was pressed, this order of magnitude, and it was easy to do this after all the messages, the delay was about half a minute, which is 10 minutes after stopping, the message was processed in about 30 seconds. We tested 20 messages per second, and it was easy to deal with the delay with about one minute. We added another 30 messages per second, and the delay was about two minutes. We also tested the multi-network situation, generating 30 messages from 5 networks per second, it can also be easily dealt with pressing for 10 minutes, and the delay is about two minutes. To ensure that there is no problem with this pressure, my son, during the 10 minutes we press the manager every 20 or 30 seconds, we will create a machine on the same network, the network card can be up in time, as the virtual machine is successfully created normally. After testing, the optimist fact is also verified. The processing company of optimist DRCV agent exceeds the processing company capacity of the current neural server, as long as the neural server can handle the amount of call we see, the DCP agent can also handle it in time. When we repeatedly delete and create virtual machines in batches, there will still be the problem of neural network timeout. This made us very stressed. Could it be that there was a problem with previous tests and processing problem where it was not improved at all? With suspicion, I went to create a larger number of virtual machines, and then observed the block situation of newly created product, and found that DRCV block was quickly removed, and the DRCV agent block was not removed. It stands to reason that when creating virtual machines, our computer limits the number of virtual machines that can be currently created on a computer load, the amount of current processing required by the OS agent is not large, and the OS agent performs the product cover and reports are processed concurrently. So there should be no problem. We conducted a lot of log analysis and code analysis. It is found that the OS agent processing also has broad neck, but not when it created, but when the virtual machine is deleted. In 13 scenarios, after a large number of virtual machines are deleted, the OS agent will be busy deleting connection tracking. It is slow to recover the network and it is very easy to cross the creation of a virtual machine to come out. The fundamental reason is that the OS agent is a single process program. Concurrency within the program is relaxed by cold routines. The cold routines cannot take the initiative to pre-mortar the CPU. The current OS agent uses a quarantine to always recover and process the work count changed, and it could either, securely, when the connection tracking needs to be deleted, the quarantine process starts to be deleted every time the connection tracking is delayed. So that the quarantine that delays the connection tracking will often occupy the CPU. However, the processing of the flow of the quarantine of a funded network card is relatively long, and the CPU is often released, and there are more quarantines that delete the connection tracking. So the overall CPU time is less allocated to the processing of the funded network card protein is relatively small. This will cause the discovery of neutral power to be slower, very easily to come out. In response to this solution, we have made a plan hoping to repeatedly start the process to delete the connection tracking as little as possible. And the deletion process can be long, so the CPU can also cause the quarantine that discovers the network, the neutral part. So our optimization plan is define neutral contraction delete. The quarantine in OS agent does not directly call the connection tracking, but it calls the neutral contract deleted to data connection tracking. The feature of neutral contract deleted process is that it can spot multiple source IP and destination IPs in this way. And I think I am contract processes that need to be started to delete the connection tracking. Now only one process need to be started and to delete the infection. Someone may ask, is the infection deletion slower than before? It used to be that multiple quarantine was deleted currently. But now one process is deleted, let's not talk about whether the deletion is fast or not. So we first solve the problem that OS agent deletes the connection tracking and always takes up the CPU. Because originally it may need to start 100 process, now only need to start one process. The process of switching back and forth is also very occupying the CPU. This ensures that the quarantine that discovers the network card and neutral part will not be affected by the CPU processing. Besides, the new neutral contract deleted to data connection tracking is not slow, not slow. First of all, we can also use the quarantine to delete the quarantine currently in the new process because the new process does not have to worry about taking up too much spare time. And after affecting another quarantine fall for the more track support the Stadia, we can opt in IP with Stadia, which can fundamentally reduce the amount of deletion. For the sake of safety, the new process will close the failed descriptor with is also time constraint consuming and for the new process we define the new train contract deleted. When he starts a new child process, he does not need to consider closing the failed descriptor. It is also a child process and will destroy the output. This can be deletion faster than before and does not affect the discovery of the network card. Let's see the effect before optimization when deleting 60 virtual machines of our computer node, it takes more than 60 minutes to delete the connection tracking. So, as in order to avoid the creation of virtual machines, after automation, connecting connection tracking can be deleted in less than five minutes. In other words, based on the new computer default configuration of 300 seconds to write for the network card after deleting a large number of virtual machines, the virtual machines are treated imagery without being affected by deletion and can be successfully created. Exciting. 12 times the improvement. Okay. The above is all my sharing as a time. Thank you.