 OK, first of all, I will introduce myself. My name is Li Hao, and I'm a cloud computer engineer from China Mobile. This is my partner, Dr. Tan Qiming. He is a PTL officer, and two years ago, I have a topic, the architecture and the deployment and optimization of 10,000 public cloud clusters in China Mobile. And this time I will introduce or I will share the new topic, the managing clusters of thousands of VMs using Selene. Next. We will talk about the topic from six aspects. I will introduce the open stack at China Mobile and the China Mobile's requirements for large-scale cluster. Then Dr. Tan will talk about something about Selene, and he will tell us something important about Selene. And she will also talk about the challenges and the solutions. We solved the problem. After that, Dr. Tan will show the results of the test, both before the optimization and after optimization. At the last, Dr. Tan will talk about the future work for Selene. Now let's talk about something in China Mobile's public cloud or private cloud. In the construction of the cloud computing, China Mobile has spent a lot of human and financial resources. Now we have one public cloud with two ports and one private cloud with two ports. The public cloud, located at Guangzhou, the south of China, and the Beijing, the capital of China. Both of the ports has about 100 nodes. And the Guangzhou port has 600 NOAA computer nodes and has 13 controller nodes, including three Selene nodes, which deploy the Selene API and Selene engine on them. As those nodes contain 115 object storage nodes and 115 block storage nodes. 13, and we also have 17 SDN and NFV nodes. The number and the deployment and the architecture between the Guangzhou port and the Beijing port are the same. Now our public cloud is online now with the function of the Selene or with the power of Selene. Our private cloud has two ports. Each port have 3,000 nodes. Two resource ports are located in Harbin and in Huahaut, Harbin is in north of China, north east of China, and Huahaut is in north of China. And we hope in the middle of 2017, our private cloud can be online. And there's no doubt that we will deploy the Selene service for our private cloud too. Next, why China Mobile has a strong design to build a multi-resource OpenStack pool in large scale? First of all, China Mobile has 800 million end users and more than 1,000 important business partners. Our users trust us and hope us to build reliable, cheaper, and schedule public cloud for them to use. Here is an example, MCloud. In China, a large number of users want to back up their local fire into cloud. So we developed a new cloud called MCloud. It is same as the Dropbox. You can upload your photo, your documents, and videos to MCloud, and we will keep the fire safe. This picture is the MCloud portal. We hope you can visit it and use it. China Mobile has used a large-scale physical machine to build this cloud and serve a lot of enterprise users and individual users. Second, China Mobile provides a lot of internal and external service in order to improve physical machines, utilization, and reduce customer and increased service realabilities. We should migrate some of the service into cloud. For example, China Mobile's internal business analysis system and internal R&D system now is well-migrating to the cloud. And third, by using China Mobile's public cloud management platform and private platform, all the resources include computing, networking, storage, centralize manager and the control. It reduces operation and maintenance costs and enhance the ability to use and manage all the resources in different data centers. Fourth, we think building a more large-scale open-stack cluster in different regions that the end users can choose. The nearest pool to create and use their resources. This makes the service and the resource close to end user. Next slide, I will, Dr. Tan will introduce us something. OK, thank you for the background introduction. OK, a little bit of background about the sending project. I know from the keynote talk this week a lot of people, for them, this is their first summit. So maybe not a lot of people know about this project. Sydney is a projection as the collection service on open stack. And we started this project two years ago. And the main targeted use cases is about auto scaling, auto heating, and that's high availability, load balancing, and flexible resource pro creation and management. On this diagram, I'm showing the high-level design, high-level architecture of the sending. So the left-hand side, you can see we have sending command line interface. And we have open stack line plug-in for this as well. Our friend from Huawei has contributed the sending dashboard, which is the horizon plug-in. You can use it today, although the feature is pretty limited so far. And we have Python and Java bandings contributed from some other developers. So we have sending API talking to the sending engine. By default, you will get the two engines deployed. What the service do is it cause the back-end services. Today, we support Nova server and heat stack as our main profiles. That's an important concept in sending design. A profile basically tells us how to create, update, delete, retrieve a resource. And that is all. We will take over all the resource group building and operation work. When managing such a resource pool, we also have some built-in policies. These policies will help tell you how to manage the cluster in a smarter way. For example, we have scaling policies, which is pretty similar to the AWS and heat design. We have health policy, which you can attach to a cluster so that you can help do failure recovery for the cluster. And the division policy that tells you when and how to choose a victim node to delete when you want to scale in a cluster. And affinity and other cross-reason cause availability zone placement policy that allows you to deploy a cluster, a large-scale cluster. The load balancing policy is based on the AWS V2 implementation in OpenStack. And that is a pretty high-level view of the sending project. Here is a more detailed view of the sending server design. So the left-hand side, we have the API, which is pretty much what other projects do. And here, we have the Engine component. The API communicates with Engine through RPC. And that is the core components of the sending server. All the other components are plugins. So one concept here we have is profile. Profile, we implemented today heat stack profile and Nova server profile. And also recently, in version 2 release, we added Docker Pi driver so that you can manage your own Docker cluster using a sending. It may be a good thing, maybe a bad thing, because that world is crazy now. So we talked to OpenStack services through a single driver. We call it OpenStack. That's the OpenStack SDK. So we don't have dependency on, for example, the Nova client, the Neutron client, all those versions. We don't want to check those versions. We have a single dependency that is OpenStack SDK. If you replace that SDK with something else, you can imagine you can use sending for other clouds. I shouldn't talk about that. It's OpenStack Summit. So the right-hand side, we have concept receiver. Basically, it's a socket. You can plug in your own monitoring service or application or software so that when something interesting happens in the physical world, you can trigger some actions on your cluster. It can be an auto-skating. It can be used for auto-skating, whatever. Sending project is not doing a monetary thing. We believe in real-world deployments, people already have their own preferred monitoring solution. We are not doing that. But with receivers, you can hook those alert alarms or notifications back to sending so that sending can react to those events. So that's an extensible design. Here is the policies. All these policies are already implemented today as built-in policies. If you don't like them, you can easily develop your own. It's pretty easy, extensible. So a little bit of history about the sending project. We started this project late 2014. Then we got accepted into the Big 10 last year. Then we have our first release this April. And now it's the V2 release. So it's pretty stable, but it's not bug-free for sure. That's the general introduction about the sending project. So back to the China mobile deployment. So we have a lot of optimizations. We've met a lot of challenges, problem issues, whatever when deploying this service there and having it manage large-scale clusters. For example, here is a typical deployment. We have six keystone nodes, because keystone is terrible. So I will talk about that later. And to make sure the sending API and the sending engine can handle large-scale heavy requests, we deploy three APIs and three engines. And all those services are deployed in Docker instances. And before that, we have HAProxy, Kibla VD to do the load balancing and redirection thing. At the back end, we have 500 Nova Compute nodes. We are partitioning these nodes into several availability zones. But we didn't use Reason, because our current experience is OpenStack is pretty good. It can handle 500 Nova Compute nodes. No problem. We don't need a new Reason. And here, we also deployed a denominator. OK. I don't know if it's still called a denominator today. It's tinymetry, panko, a, whatever. We use this service to do resource monitoring. And the primary use case is for bidding and auto-segaling. So that's the deployment. And starting from this slide, I will share with you about some experiences when we deployed this sending service for VM cluster management. The first thing we did is for the hardware infrastructure layer optimization. To make sure the network bandwidth is insured, so we have three nick-bounds per physical node. And we separate the business and the management and the storage traffic from each other. So that is the preparation. And for the core services, especially the influx DB node, we use SSD because this IO was found as the primary bottleneck there. The next layer of optimization is about middleware. For HAProxy, we tried the default config where we have about three processes. But soon, we found that it's not acceptable. So we increased that number to 10. We have 10 HAProxy running. And we also did some optimization to the failure detection in HAProxy so that the HAProxy can react quickly to back-end service failures. Previously, this failure detection is only done through TCP, which is not that reliable. So we improved that. And the second optimization is about KEEP LabD. We improved the exception handling in the phase of network adapter failure. When such an event happens, we quickly restarted the network. And if that still cannot solve the problem, we will report a node. And this also helps the service availability improvement. So for large-scale cluster creation and management, there are two important issues. One is reliability, and the other is efficiency. So service availability is also very important. For RabbitMQ, we enlarged the connection pool. And we enabled heartbeat check to ensure that the machine queue is always usable. And we did something to avoid to save RabbitMQ from brain problem. That is still a problem today. Just cannot understand that. So we also increased the number of open files setting to 100,000. That's a big number. Pretty, it means unlimited. So that's the optimization about RabbitMQ. For MySQL database, we also enlarged the connection pool. It's very big. It's also 100,000. So no limit on this. MySQL is running on a very powerful machine so that the database node won't be a bottleneck. And the team also investigated and solved a problem about the DB catalog. So if you are setting MySQL cluster in active node mode, one node request maybe want to write this table, and the other request maybe want to run that node. That table, sometimes we will have a deadlock problem. So this is something that team figured and solved. For the sending service itself, previously we know we are targeting a large-scale cluster operation scenario. But we don't have a lot of experience. So all these improvements were done gradually. The first step is we allow for a multi-engine deployment. And for each engine, you can enable, you can set multiple workers. But as many other open stack services, sending is no exception. It's written in Python. We still suffer from the global interpreter log problem. So we don't have real parallelism. We don't have that. So to solve that problem, we are looking to multiprocessing in the future, maybe next release or the release after next. So the next optimization is about action, action queue. When you are creating a cluster of 1,000 nodes, you are not supposed to send 1,000 requests simultaneously to Nova. That is a DOS attack. So yet another factor about action execution is on web, in the cloud environment, most of the requests are processed asynchronously. When you want to create a Nova server, that request returns to you. And it says 202. It means request accepted, I will process it, wait. It will be handled. That's the common scenario. We cannot say, OK, you want to create a Nova server. Here is it. That is not possible. So in sending, we are creating a large scale cluster. So we designed the action execution to be asynchronous from the very beginning. Every request arriving at the sending engine will be queued. It will be saved into database and executed later. That's one of the design. And when we started to schedule the queued action for execution, we use a concept similar to the linear kernel design is to take this scheduling. I'm not going to dive into that detail. But it helps improve the action scheduling. In version 2 release, we also added batch policy so that any cluster update operation can be performed in batch so that you can guarantee that in your cluster, there is minimum in-service node kept alive. Those nodes won't be bring down when you are updating your cluster. So that's an optimization. And the keystone. Keystone plays a very important role in OpenStack. It's not just for the end user authentication. It is used for authorization sometimes. And in sending, it is also used for trust or verification. And in most API pipeline, keystone is also using a middleware. For every request, if you are looking at the keystone log, you will see how busy that service is. So we soon discovered this. And in the China Mobile case, we deployed keystone as Apache plug-in so that we can benefit from the Apache web server design. And we increased a number of workers for the service. And another interesting thing we did is about the token format. The default configuration is PKIZ, the token format. And we did some experiments, some benchmark, using a different from Metaphronet. And that change has two results. One result is when we profiled this using a rally and sending the concurrency to 5,000. And if you are using PKIZ to complete this benchmark, we need about 160 seconds. But if you turn to use Fortnite, it's about 57. That's a huge save. And also, another side effect is the failure rate is greatly reduced. Previously, we have about more than 400 failures. But after this change, we have only four. So that's a huge improvement. The next optimization is about NOVA. We tuned the microprocessor size. That's not the difficult to imagine. And we also optimized the scheduler host subset size so that the chance having NOVA to schedule 2VM to the same host is not that high. So after this setting, NOVA will try to pick random node from the top 10. So that greatly improved the service concurrency throughput. And for the channel mobile deployment, we also tuned the flavor properties so that we can ensure the host resources will be used efficiently. Yet another factor we did is we use config drive. We don't use metadata service. It kind of common knowledge today. The NOVA metadata service is not that reliable. The last crazy thing we did is we cache all the images at all compute nodes. So when you want to start a virtual machine, you don't need to download that image from GANs. So that's a huge optimization. The next thing is about neutron tuning. According to the experiment, previously, if you are using OVS tools to distribute a flow table to change the OVS DB part configuration, for each record, it takes 0.4 seconds. That's a huge amount of time for this particular operation. So the team there tried to optimize this. For the restart OVS agent command, we parallelized that execution. And we also did some other optimization so that, for example, we avoid deleting the pause and the flow table when we restart OVS agent. This patch has been merged into Liberty. And we also used local controller RYU to distribute the flow table. That patch is also merged into Liberty. The last thing I'd like to mention about Neutron is LBASv2 today is pretty slow. For the status update, previously we said the timeout to be 10 seconds or so, but it's still not enough. So we don't know the real deployment, what kind of a threshold would be the best. So we made that LB status timeout a configurable option. When you are deploying, sending, and attach a load balancing policy to that cluster, and you can tune this number for your particular setting. The next optimization is about 10-metre denominator. As I have mentioned, a denominator is very important for the China Mobile OpenStack deployment. They are using it for Baylent's purpose and also for scaling the VM cluster. According to our profiling result, the denominator plus MongoDB deployment is not a good combination. It is virtually not usable. So later on, we migrated it to use Naughty. And after this shift, we now can get a sample query in less than five seconds for 10,000 VM deployment. So that's a huge improvement as well. So after all these optimizations and some others, I'm not mentioning it today. So we finally got some results. These are very preliminary results. Don't trust us, because these experiments were done in non-production environment. And when we were doing this experiment, the production environment is already online. But it gives you still some idea. Back in July of this year, we still have some low concurrency and failures. But after a one-month work, things has been improved significantly. And speaking of future works, next steps, and how I just mentioned, we will deploy some other sites at Hohauthe and Harbin. Both will have 1,000 nodes also. And the next thing is about the container management. We have a preliminary implementation today. We are going to go in down that direction anyway. Because in many use cases, people don't want to shift from OpenStack Online to Kubernetes to SRAM to whatever. They only want to use Docker as a lightweight version machine to run their workload, deploy the scalable and highly available period. So that's a similar use case. We can do that. And the last thing is about the management of sending service. Today, sending is deployed as Docker container instance. If the host node is restarted, it fails, we have to manually start the Docker instance. That's terrible. So it's not a complicated task. We've solved that problem as well. With that, I'm pausing here and see if we have any questions, suggestions, donations, whatever. Thank you.