 Now, I will show our works about how to deploy the OpenStack on 10 to 2 simple computing. And firstly, we will introduce how we will deploy OpenStack on HPC supercomputer and the deployment architecture and how we optimize the OpenStack on a large scale. Give some performance evolution and our contributor to open source community. Firstly, TNR2 supercomputing is sponsored by China's government and the Guangdong province and the Guangzhou city government. He is built by our national university of defense technology. In this year's June, he gets the number one of the top 500 list of HPC supercomputing and his fixed performance is 54 pitfellow and the impact performance is 34.9 pitfellow means. And currently, TNR2, TNR2 supercomputer, he has a record new heterogeneous architecture. Why? Because we used the Intel, Xeon, CPU and FI and accelerate only asked X86, the ISA. Currently, he has got 16,000 nodes for HPC and now he is installed in the national supercomputing center in Guangzhou. At first, we think that the supercomputing center, he gets a very large and very fast computer for HPC, for high performance computer. He is open platform for research and teaching. But moreover, we want the list of supercomputers be a public information infrastructure. So we went to HPC, so we went to deploy the OpenStack on this supercomputer as an IS infrastructure. So we deployed OpenStack on this supercomputer to provide the hybrid cloud infrastructure and we virtualized it as the IS cloud computing. We went to meet some requirements with the different environment and the different application and the different skill. The hardware for OpenStack the deployment and just as I said, the TNR2 supercomputer have 16 nodes and we were used the 6,000 and 400 nodes as OpenStack cloud infrastructure and it is 40 cabins and 128 nodes in each cabinet. In each node, we have two fellow average inter-CPU and 96 giga RAM and just one local disk is the one terabat and we get 2 giga ethernet. And moreover, and we defined a private internet network, we called TNR, his bi-director bandwidth is 160 gigabatt per second. For OpenStack, we used the Ubuntu server as an infrastructure operating system and when we deployed on the computer, we used the OpenStack's grid and we used the thief as the backend storage and we used the puppy with our deployment tools and the deployment architecture of our system. Currently, we test 400 nodes as the OpenStack as infrastructure and in which we get 2 cabins, that means the 256 controller nodes. We used 1,000 API nodes, each node run no API, no thieves and grants the service and send the service and we used the 4 nodes for LVS and keep live D and the one is load balance must and use the 3 nodes as his backend. And for network nodes, we used the 96 and used the 32 nodes for Ganglia and the Negros. In this deployment architecture, we can support the 4,000 computer nodes and we used the computer nodes into the 13 threads because the threads can supply the nodes into a large scale and each 4,1 cabins and each threads get 2 threads slower. How about his network template and we get 3 network techniques and 1 is 1 giga is for management, SNL is stability and we think that his performance is suitable for management. And another giga is for thief storage. During we deployed the OpenStack, we have tried our NI network. The NI is the RDMA's Internet, I interneted the network but we virtualized the network to SNL. We can get the high, get better batch, then giga batch. And if we used this NI, we can get more, better performance for storage and IO performance. And other time, we used the internet for virtual machines communication and that means every virtual machine communicate each other with this internet. And then the second storage, OpenStack, he used the suite but we think we used the 3th as all-in-one storage. That means we used the 5th RDB for the image storage. We used the RDB for the volumes back end and we support the 5th with every volume and mounted by the machine. And we used the 5th POSIX FS for the instance storage. With the 5th FS or POSIX function and we can construct a short storage for each computer node. So we can support the virtual machine level migrate between the nodes. How do we deploy the software? We use the puppy and we defined different manifestors for different nodes because we know for OpenStack, we have a node, we have a graph with different nodes. We used the different puppies in the manifestors for different nodes. And each node which is low for the same configuration fields. For 4,000 or 6,000 nodes and we used the thickness system to remove the boot for each node. And here is low management cost for so large scale. And we used the Ubuntu server as the base operating system. And we used the RDMA, THNI, NIC to pro-image to accelerate booting. And when we first boot, we will partition the hard disk and set up some default elements such as SHB out password. And because we used the disk-based system, every node gets the same image. And how to define different IP and host data and different laws. We used the NIC's dynamic. And after we deploy OpenStack on thousands of nodes, but we think his performance is very poor. And when we start the virtual machine, mostly it's failed. And how can we optimize the OpenStack's performance and increase the visibility for thousands of nodes. And we used more nodes, more work threads, and we optimized for better performance. And with the 4 more nodes, it's clearly when one node, when one API server is good enough, we can increase more server to support the larger skills. And we do a lot of balance for every API server. And we spread the Kingston API server from the content. So we can support more API servers nodes to support the larger skill. The more important optimize is how we can do with the more works to support. Firstly, we used the multi-processed for content neutral server. We increased the API's work. The default value is one. Just one API thread, one API work thread. We increased it to 24. Just as I said, each node has two Intel CPUs with total thread. So we used 24. 24 API's work set up as a neutral server. And moreover, for another API, we increased the OS API computer logs and the metadata works. And for EC2 API, we increased the EC2 works case number. For Nova Conductor, Nova Conductor is support the Nova computer nodes and communication with his backend database. And now, we set up to set the work number as the same as 24. And at the same time, we increased the glance API's work thread. This is for multi-processed. And at the same time, Kingston, he don't support multi-thread or path process. We used the Apache hosting. Kingston, he is the state list service. And we used Apache to support the Kingston and the Papi. And how can increase the API server's performance and capability for Nova? First, we in Element API's Rates limit. And we used the large DB pool size. We enable the SQL DB pool and set up the minimum and maximum pool size. And we used the number is minimal. It's 30. And the maximum size is 120. And when we do performance text, it's all... It's... Queen Pro always gave me a timeout error. His report, Queen Pro limit is default size is 5. And when so many nodes assess Nova server, and we increased this number to 60. For neutral, for neutral, sorry, for neutral, we used the large database to process. And this is setting. And we set this number to 60. And in the meantime, and we increased the agent downtime. And when we test the HTTP... Most... Which machine image can't get the HTTP IP? And we checked that. And the agent downtime, his default number is 5. And we increased the 5 to 30. And we can support... We can support the... More... Which machine to get his dynamic IP? RabbitMQ. RabbitMQ, we set the high memory watermark. And the default high watermark is 1.4. And we increased to 1.6. And he can get the high performance. And similarly, we set the high socket limit. And the default number is 1,000. And we get 100,000. And we... And I think RabbitMQ is mostly bottlenecked for the APS server. And we... We are costing the RabbitMQ server by one more level to distribute the workload. About the backend database, we used the large maximum connection. And increased to 1,000... 100,000. And we used the Galileo as a method to construct a method cluster. With the help of this Galileo, we can choose some APS server from one... My method master and using the LVS to provide the load balance. And currently, we use the KVM as the hypervisor of our... Of our supercomputer. For KVM, we optimized with the HLGPableEnable. And we used the VhostNet. VhostNet, he can get better network performance. And use the VTRIO block. He can... He gets the better performance for storage's IO device. And similarly, and we used the kernel thin page to merge it. That means... That enabled our two old fields of memory. And he gets the COW core, the write, and he can get more performance. The fourth schedule is the CFQ. And when we used the largest skills, and we built the IO, it's the botnet. And we turned the schedule to the IO deadline. His strategy. How about the service, the high availability and the load balance? And the V for all APS server, we used the LVS and KVM for all APS server. And we used the RPMs response timeout. That set is... Can increase the performance and increase the quantum URL's timeout. And both settings can increase the success rate when we put a larger number or for instance, similarly. And because we even to provide the IS infrastructure service, so high availability is important for customers. This is for everyone knows, he gets his KVM's hyper-weather. And we used the CF, just like that. We used the CF as OEM1 storage. And we used the CFRDB as a CF, the POSIX FS to our instance storage. So we used the help of PISMIC and the CF. And the CF, he can support each machine's high availability with the level migration. How about the CF performance? And in CF, we when one client read, write some dates, he will consult the MDS for metadata and then get the date from the OSD. If we went to support the field small field, we gave some optimized inline data support. And we storage the small field data on MDS, so client can use one request to get the metadata and the real data, similarly, and from MDS. And the second, the fifth optimized is we increment the puncture flow support. What's mean puncture flow support? And just we know when each machine image is bought, we can read and write or storage every date on the image machine. When we delete some fields from the image machine, he get a puncture in the image field. We enable the POS field support when someone is deleted his date or fields from the video disk. We identity the punch is under recycled and to show a look for an other application or another users. With this optimized, we can just for some performance test. And just I said, currently we deployed an open stack on 4,000 nodes and in which 200 nodes as an APS server and 4,000 nodes as a computer node. And we test the maximum which machine started simultaneously and we use the fellow test image with tiny flavor and we get the 5,000 and 300 fellow image to put at the same time and everyone can successful and active and get a collected IP. This is our performance test result. And our performance test and just the window net. This instance is frequently operation for open stack operation and we set up to the active instance with 1,000 and 1,000 and how can we get can handle the concurrent list of instance request for second. And we get to the list of the number and we get to the 1,000 and 300 successful operation for our deployment architecture. And during this test and we get to performance the bottleneck and the first we management is gig net and we think the network bendwise is the bottleneck. So, we can use the KGNI for the management and then we can get more performance. After we use the NI then novice API is become a new bottleneck. So, it's let me to use the more API server. So, and we build the system on open stack open source and then we can we do some bug fix and they contributed to the open source community. And for Havana we commit to 5 BP and 10 bug fix and the 4,000 line of kids and for H version amongst the independent developer and we ranked the number one for BP and the number six for LC. We give a content BP for project to the user content support. As we know currently for open stacks G version and the users we can get set up the users content. We just give a whole project for hyper cloud infrastructure. We think the cloud infrastructure project user and we want to give more process the content control management and we add the user inside the project and he get the two levels content and we and other we commit the editable default content support. Yeah, and other we our commit the point is assess the which machine while part mapping as we know and you fit in the method to assess the which machine have two methods and the one is for the IP case but case limited by the number of available the IPs and the other one is VPN but we think this relativity complex to configure. So we map unused port of the public IP dynamic to different virtual machine to support the assess from outside or for cloud assess to assess the inside the node and we design the UI for some customer and with the set we commit a two-point just the reset inline data support for small field and puncture for support and we give to the 10 commit 20 commits and then our team and its implement the multi-label user management to his dome where our project is next team for Nova and the multi-label user content management and the multi-label fix currently we build two cases on the TH2 specific component only in government for Guangzhou city and the first one is websites cluster of Guangzhou city's government and he went to web server application server and database server we optimize the with load balance and have ability to support the government's website cluster and other case is his data management system he we use the save rdb as all the servers back and storage and we optimize the virtual machine with for all local and for the driver how about our team's the future works and for super computer sense we went to get the more performance and the stability test and how about the field how about the future the content and the glass and then sorry and then for the super computer sense we went to automatic the operation for large-scale the system how to operation the thousands of nodes and we we will upgrade from G to Havana and this month and for the Guangzhou super computer center more over as service and we think we went to provide xas service such as database hardware and more over for open source community we focused on automatic operation tools for large-scale system how about to provide automatic and the remote delivery deployment monitor building and just and just now we developed our electrics resource management code to provide a qs-based virtual machine cluster scaling and scaled dynamically and for safe we optimized the let I had the strategy and based the SSD let the backer catch the support this is our team oh thanks 400 nodes 8,000 for for 40 cabinet we get a 4,000 nodes and for 60,000 nodes we get to the 50 cabinet yeah yeah and uh and how we decided to the one one hundred one hundred nodes as a cell and with some religion and the first one is the and for and and the job that we will instruct the virtual machine cluster we learned to the uh brilliant and currently we just support the villain with thousands hundreds of nodes and the counter supporter of thousands of this note so we can support uh we we can set the 100 and as nodes as a cell scale and this is the first reason and the third reason is for performance and just I said and we set two cell controller nodes for one for one cell and if we increase the too many nodes the in one cell and the API is the Nova API's communication cost is very high and we and the two nodes can't wait and we test two nodes counter support too many nodes in our cell for Nova API