 Hello everyone. I'm a cloud computer engineering from China Mobile, and my name is Li Hao. And I engaged in open-stack development and maintenance more than two years. And today I would like to talk about something about China Mobile cloud computing and the architecture and optimization on 1000 node cluster in China Mobile. I will talk the topic from four aspects. First, I want to talk about the practice of open-stack in China Mobile. I will introduce the plan of China Mobile in the public cloud and private cloud and introduce the progress of building open-stack source pool. Second, I hope to introduce the architecture and deployment of China Mobile's public cloud. I will show you the topological graph and a lot of details of the topology and the tools we used to develop the open-stack. Third, I would like to share with you what we have done and optimized in the process of deploying a 1000 node open-stack cluster. I will analyze almost all the computer open-stack components and introduce the optimization of every component. At last, I will introduce the joint test of our public cloud. The test is worked by China Mobile and Intel. The main purpose of the test is to analyze which state is the most time-consuming in the process of creating a virtual machine. At that time, after we optimized the bottlenecks of systems, or in other words, most time-consuming steps to what extends the performance of the whole system is improved. As I mentioned earlier, in the construction of cloud computing, China Mobile has spent a lot of human and financial resources. Now we have one public cloud with two ports and one private cloud with two ports. The public cloud port is located at Guangzhou in the south of China and Beijing, the capital of China. Both ports have about 1000 nodes. Guangzhou port has 600 Noah computer nodes and 13 controller nodes. Other nodes contain 115 object storage nodes and 115 block storage nodes. We also have 17 NFV SDN nodes. The number of nodes deployed in Beijing and Guangzhou are roughly the same. Our public cloud is online now. You can visit the website ecloud.1086.c Our private cloud has two ports. Each port will have 3000 nodes. Two resource ports are located in Harbin, northeast of China, and Huahuaotze, north of China. We hope in the middle of 2017, our private cloud can be online. By the way, in Harbin resource port, 3000 nodes are all bare mental nodes. We will use ironico to manage them. Perhaps by the time it is the largest ironico cluster. Public cloud topological graph. Now I will introduce the architecture and deployment of China Mobile. Let us look at the graph. First, we have three network isolated zones. The DMZ zone, the core zone, and the production zone. The left is the DMZ zone. DMZ zone can be visited from the internet. We deployed the NOVA, NOVNC proxy, and HRE proxy to keep live in this zone. The middle of the picture is the core zone. We deployed the controller node. For example, NOVA API, single API, and keystone in this room. The right is the production zone. About 600 computer nodes, and 115 block storage nodes, and 115 object storage nodes are in this room. And also, glance API and single volume is in production zone. Second, every physical machine in our public cloud has six nicks. Every two nicks do bond. So we have three bonds, management brand, business brand, storage brand. Traffic in the core zone is through business network. And the traffic between the core zone and the DMZ zone is through the MDMT or manager network. Traffic from core zone and production zone can only pass through the management network. The traffic between VMs and VM visit and VM visit internet is through the business network. And the traffic from VM to block storage and object storage is through storage network. Third, the single volume use SHIFT-DOG and IP-SAN as a back-end. SHIFT-DOG and IP-SAN is deployed in the production zone. So single volume must be also in production zone. Glance API use SHIFT as back-end. And SHIFT is also deployed in production zone. So Glance API also in the production zone. The last, every OpenStack service is deployed on three to five physical machines. We have HRA proxy and keep alive the load balance. We deployed two MyCircle gallery cluster. One is for Nokia as a database for index, one for other component. In this slide, I will introduce some details of the topology. Horizontally expanded service, horizontal extension of the core service. It means we deployed six keystone because we think three keystone is not enough for us. So we can deploy more. And we use HRA proxy, keep alive LVS, do load balances. So as a service should a valid for example keystone by the VIP, which is created by keep alive instead of a valid one of the keystone IP. So there is no single point of failure. And we use 5 plus node to deploy the API gateway. And each service reside on different track. It means a physical machine deployed multi-service and each service reside on a different rack. So I think this will ensure that the system is highly robust. We also use availability room and host aggregate in our public cloud. Computers are divided into many azs. For example NFA azs, big data azs and the default az and no azs. And we also use two host aggregate. One is Windows aggregate and the other is Linux. Because we think the Windows VM and Linux should not be on the same machine because Windows license may charge in accordance with the number of physical machines which are running VMs. In our public cloud we have two pools. Every pool is independent. They have their own OpenStack component like Keystone, Cinder, Neutron, GlonSnova, Stirling and RabbitMQ. And we don't use multi-region because we think our experience is about 500 computing nodes can be in a single region. Two independent OpenStack pools are controlled by China Mobile Resource Management Platform. We call it OP. The end users' information are sectionized in different OpenStack pools when users register a new account through OP. OP will call Keystone RISC API to every OpenStack pool. This will ensure the users is registered in both pools and the name, password and information are all the same. But other resource is not sectionized. For example, users' VMs, users' V-router, key pair is independent. And most of the resource can't migrate between the two pools except the image resource. Now let me talk about the deployment tools and how to deploy a multi-version of OpenStack component. First, we use Ansible to deploy it because we think Ansible is very easy. Then, Puppet or SourceStack, sound-useful model is share, copy, script and playbook. Our deployment is not automatically but semi-automatically. We write some scripts like the pictures to accomplish the deployment. While we don't use PackStack or DVStack, I think PackStack can only deploy one single controller and one single network node, and the controller is not high availability. It is not suitable for big public cloud, and DVStack is more flexible and fit for developers, but it's also not fit for big public cloud. We have a problem in deploying the public key. Sunlin and Nokia has no kilo version, so we have to use high version for these two components, and other components still use kilo version. We try to resolve the dependence of different OpenStack versions, but we failed. At last, we think the deploy high version of OpenStack in Docker is a good idea. I will talk about the optimization on this slide and the following slides. We have three levels of optimizations, infrastructure optimization, and the mid-wild component optimization, and the OpenStack component optimization. First, every physical machine has three bounds in active and backend mode. The business network and the storage network are separated to ensure the business traffic and the storage traffic are independent between each other, and the core service uses SSD. For example, in FlexDB, we use SSD because we think this will make the data write and read faster than before. The traffic between the code zone and the production zone is through management network. The traffic in the code zone is through business network. This makes full use of the network resource. The optimization of HRA Proxy Keep Alive MariaDB Gallery, RabbitMQ, I will talk in the next slide, optimization of middle-wild components. For RabbitMQ, we increase the RabbitMQ connection pool and add a heartbeat check, and we add the function of auto health, and we increase the open fires for RabbitMQ. For HRA Proxy, we configure a reasonable number of HRA Proxy process and strengthen the detection of backend service. For the Keep Alive, we enhance the exception handling for Keep Alive and add the exception handling for Knicks. If the Knicks in Keep Alive is down, the script will try to restart the Knicks. For MyCircle, we increase the max connection pool. The default volume is too small, and we solve the MyCircle data log problem. We all know that MyCircle gallery mode is active and active. When no API receives more than one request, MyCircle A serves the request 1 and MyCircle B serves the request 2. The request 1 locks the table A and wants to write table B, and the request 2 locks the table B and wants to write table A, so maybe the data log occurs. We configure the HRA Proxy, makes MyCircle gallery from active active to active backup, so the situation will not occur anymore. Novel optimization. Profile optimization. First, we increase the max pool size and the number of workers and the scheduler host subnet size to increase the concurrency ability for NOAA. And we use a reasonable ratio of CPU, disk, and memory to increase the number of the running VMs in dependent machine. We also use more scheduler filter to complete precise scheduling, and we use config drive instead of metadata server because we think the metadata server is not robust. And we developed some NOAA new features like change VM's secret and change the VM's host name and execute the command in the VM through the NOAA API. And we optimized the code of volume touch and volume detach. And we have practice optimization. All the images are cached on each compute node. We can do these things like specify the physical machine to create all kinds of VMs in one physical machine. Then the path VR log NOAA base has all VM image cache. Then we transport the image cache from the physical machine to all the other NOAA computer nodes and then create a VM in any of the NOAA computer nodes where no need to download the image. Greatly speeding up the VM provision. This is our practical optimization. Optimization for Newton. Profile optimization. We increase the workers and we increase the database connections and accelerate the restart OVS agent. We can see the Rayleigh test. We create network, we create subnet, we create routers. Concurrent is 1,000 and all the tests are success. And in this slide, we will solve the restart OVS agent. It is very slow. We are using the OVS OF control and OVS OF control to distribute the flow table and configure OVS port. OVS DB port is time consuming. The average of 0.54 seconds to configure one flow table when in large scale will be very slow. Let's look at the table. When restart OVS agent in NOAA computer need 15 minutes, this is too long. So we change the Newton code restart OVS agent. First, we stop distributing the duplicate flow table. And second, we make the code of restart OVS agent from single thread to concurrency process. Third, we do not delete the port and the flow table when restart. And the first, the local controller is used and the flow table is delivered through the local controller RYU. After the optimization, the recovery time reduced to seven minutes. Optimization of monitoring. First, we use Celerometer NOAA key influx DB instead of Celerometer and MongoDB because in a large scale, Celerometer MongoDB is very slow when choir sample or monitor date. After use NOAA key, the response of query sample is in five seconds with 10,000 VMs. And our monitoring system is also support OP for billing and also support sending to do elastic stranger. Now I have a brief introduction to China Mobile's monitoring architecture. First, Celerometer agent sends the monitoring data to Celerometer collector through the LVS and the Celerometer collector will transform the date and send them to NOAA key to store through the HAA proxy. NOAA key receives this date and stores the monitor date index to my circle gallery and stores entities to influx DB. This is the brief introduced to our China Mobile monitoring architecture. And the optimization of sender, profile optimization. In our public cloud, we use sender and our sender support ship dog cluster and IP send cluster at the same time. And we add the hard bit time out to this old parameters for bidirectional hard bit check. And we also modify parameters, provision the capacity, put a reasonable ratio of volume capacity size and expand the OS API max limit volume to increase the number of entries displayed and we also configure rate, a reasonable QS to limit the IO of each volume. Second, the high availability optimization. Multi-sender volume are configured with the same host and we use the pacemaker to manage a multi-sender volume cluster. So at the same time, only one sender volume is up and the others are down. When the active sender volume is crashed, pacemaker will restart it. If it can't restart for many times, pacemaker will start another sender volume and the sender API's high availability is also used HRA Proxy and Keep Alive to do load balance. Optimization of glance. China Mobile has developed some new features for glance. For example, the images are shared across different pool. If I upload an image to Beijing pool and the OP will synchronize the image to Guangzhou pool. So the images between the two pools are all the same and we also developed a new function. Images can be downloaded by the download tools. For example, Flashgate and Dublogate because we think the end users don't want to use Glance Client to download their images. Next, the practical optimization. First, we use Glance API to use image cache. We think the image cache may increase the download speed and we configured the NOAA computer. The NOAA computer uses Glance by storage network to access the Glance API's endpoint address. When users call NOAA image create command to backup its VM, NOAA computer will first make four backup of the VM's disk and then upload it to Glance. By default, NOAA computer will call the VIP of the Glance but it costs a lot of time. The traffic starts from the NOAA compute in the production rule to the HRA proxy in the call rule and then the traffic is from the HRA proxy to the Glance in the production rules. At last, the Glance API will send the date to backend. If the NOAA compute can upload the image directly to Glance storage address, the time cost will be lower. The optimization of keystone. Profile optimization, we use more workers and we run keystone in HTTBD because we think HTTBD is good for us than the normal keystone. We use four-net token format than the PKIJ format. Here is an example. We have a comparative test for PKIJ and four-net on 5,000 concurrent. Each request contains two processes. One is get a token and the other is verify the token. The left picture is the result of the PKIJ and the right picture is the result of the four-net. We can see the four-net has less failure and less cost of time. Now we go to the first part. 100 node performance in this part. I will introduce the joint test for our public cloud. The main purpose of the test is to determine which part is the bottleneck when boot a new volume in OpenStack. First, I want to introduce the method of the test or the technical of the test. The tests are in two phases. First, acquisition phase. Monkey patch the live system. Maybe we will insert the print log sentence in the most important function and then the OpenStack will produce logs with timestamp for analysis with minimal overhead. And the second phase, offline analysis pass the collected log and reconstructed concurrent request handling processes and render reach statistics. We can see the picture. The picture shows that the elapsed time stack acquire in different concurrency of creating VMs. XS shows the different concurrency and the Y-axis shows the cost and time-consuming ratio of different components. We test from five concurrent requests to 2,000 concurrency requests. The scheduler costs the most time during the instance provision. It's worth when the request pressure increases. We will see the scheduler will be saturated at eight concurrent requests. And I think the really is a black box provision and it needs constant polling for intermediate state updates not accurate and powerful enough. But I think our test is a white box profiling. It has deeper insights into the OpenStack system without excessive monitor overhead. And the test is state machine parsing because the analysis can enumerate all the successful or failure status of every tracked request and replace the concurrent provision process for later analysis. When the concurrency is 2,000, the throughput is 1.178 requests per second and the failure rate is up to 414.1 and the retry rate is up to 26.3. This result shows us that we meet some problems when concurrency is large. The right picture is the analysis of an individual request using state machine. We record every step with steps during the instance provision. If the request meet error, we will also record it. And the left picture shows us the failure reasons. For example, the Neutron error creating port on network or OpenStack optimization. In fact, we have solved the three root issues database and data logs, Neutron port create a failure, crystal authentication failures. So after we solved the problems or after the optimization I mentioned before, we can see the performance has improved a lot. The failure rate is from 25.720 and the retry rate is from 29.080 and the work clock is to 417 to 115 seconds and throughput is rise from 1.31 to 4.35. Thank you for watching. Thank you for listening. Thank you very much.