 Bonjour. I think we can get started. So my name is Ximing, I'm with IBM, but this talk is about, it's not about IBM, it's about suening.com. You can see the names Zhang Xiaobin and Jinlong here. They actually proposed this talk to the summit, but unfortunately they cannot make it today. As for the reasons, I will mention that later. I have been involved with suening.com for quite some time now, and today I will try to present for them. If you have any questions, please contact these authors. This is the agenda, the topics I will touch today. First, a little bit introduction of suening for those of you who have no idea about what suening is. Then suing is OpenStack journey. Most OpenStack startups or other companies adopting OpenStack, their journey is not a long one, but they have learned a lot, and they have some wish list to share with the community. This is an overview of suening.com. This company actually was established more than 20 years ago. It is today the largest failure company in China. It is among the top three Chinese private enterprises. You may want to read this a little bit carefully because it's a Chinese culture. When I am the top one, usually I will say I'm among the top three, so you can get the idea. It's almost the number one. Suening's business is not limited to rotating. Today it has logistics, supply chain, and even real estate investments. The company is growing very, very fast. By the end of 2012, some data need to be refreshed. They already have stores in more than 700 cities in China and abroad, in other countries. The total number of staff is 180,000. That's the big company. They have R&D centers in Beijing, in Shanghai, in Nanjing, and even in Silicon Valley. The brand value and annual revenue is over there. You can read the number. To give you some rough idea of how large the retailing market is, I'm also showing some numbers over there, here. This is only the first half of 2014. The first half of this year. The total market size is $175 billion, and the year-to-year growth is more than 40%. This is a big cake. Why Suening has to do cloud business? This company has been using cloud for quite some years. To be honest, they didn't start with OpenStack. They were using some other cloud software. What they see in cloud is cloud is actually bringing them many opportunities. Needless to say, there are also many challenges as well. They want to adopt cloud technology just like some big guy said. Every company today is a software company. Suening is trying to innovate themselves, to seek opportunities in business model innovation, new collaboration models, and also a lot of other opportunities here. Suening cloud, this is also a brief overview, is not OpenStack. It's using some other cloud software. They have their own private cloud and also open cloud, public cloud. Their private cloud has multiple data centers. Just in Manjin City alone, they have two data centers. Also in Beijing, they have thousands of physical machines and tens of thousands of virtual machines for their private cloud. They have installed a lot of middleware and applications. They have implemented their own automatic deployment, orchestration, and so forth. They also offer public cloud as business. They provide virtual machines, virtual storage, database as service. It seems like this company is evolving to something else. Suening's OpenStack journey is not a long history. They only started using OpenStack since middle last year. I think there's one version. They did their install manually, step by step, hacking the configuration files, all those kind of stuff. Today, this deployment has evolved from a single deployment to a multi-data center deployment. The usage is not limited to R&D activities. They have gradually migrating their existing workloads to OpenStack cloud. Here is some status about their installation. About compute, they have not so bad physical hosts. For storage, they were using a standard multi-backend combined with LVM and Glastro with QoS guarantees. For network, they have isolated the main data and storage traffic. They use OVS bounding to improve availability. They use hardware-based load balancing. There's network. They are also experimenting container. Container is so hard today. Suening is no exception. They want to look into this to see what they can get from container technologies. I think today the topic is, if the topic is about container, this room will be very, very clouded. As for deployment, they have been using cobbler and the puppet. They are trying short and ansible now and anything. For the controller nodes, they have set up three nodes for HA's purpose. For monitoring, they have been using proprietary monitoring tools and some open source tools as well. For optimization, they have resource scheduling for single node or multi-tiered or distributed application configurations. Next, I will talk a little bit about their workloads. What are Suening running on the OpenStack cloud and their previous cloud? I'm calling this a drill because it's pretty complicated, pretty mixed. You can see they have more than 100 applications. Each application looks different. It's a mix of CPU-intensive workloads and IO-intensive workloads. For example, they were developing mobile applications for their mobile endpoints for consumers so that you can buy something using a mobile phone. Those development activities were host on cloud and it means a lot of storage requests. 800 gigabytes or even one terabyte of storage requirement. They also do some search engine compilation and they do sentiment analysis job. Needless to say, they want to monitor what the consumers say about Suening. That's some very valuable information feedback to the company. An interesting point is thumbnail picture, those image generation, they did it online, on the fly, using some open source library. They didn't pay any attention to this but later on they found all those thumbnail picture generation is very, very CPU-intensive. It's not what they expected. For this company, for their internet applications, they have been using different software stacks. They both use open source stacks, Apache, JBoss, MySQL and also some IBM solutions, IHS, WASP, DB2. They are using something else as well. It's a very complicated mixture. The complexity of their workloads is not limited to this. Here I'm showing you a very simple, very representative multi-tier enterprise workload. You have a web front-end serving the HTTP request. You have your business projects deployed into the middle tier, the application server. You have your data stored in the backend, which may be a collateral or something. Anyway, it's a database. In Suening's case, because they have 100, more than 100 applications, these applications came from different development teams, different departments. Sometimes they want the front-end clustered. They want it to be auto-scaled. They want this front-end to be deployed on different halls, the different availability zones across state centers. For the middle tier, they also have this requirement. The middle tier is where the core is executed. It's their business logic. They upgrade their applications every one to four weeks. It's a pretty short release cycle. When you have your VMs deployed, be prepared. You will upgrade it just a few weeks later. It's a pretty short release cycle. For the backend database, sometimes they want this to be active passive configurations. Sometimes they don't. They also have some placement requirements as well to add more complexity to this picture. This connection between tiers is very, very flexible. If you are scaling your front-end or your middle tier or your backend, these different tiers need to discover each other automatically, dynamically. The service discovery and registration is a difficult problem for them. Sometimes they have to hard-code it, but today we hope some tours from OpenStack can help this. We started with heat. That's about the complexity of their workloads. As for the deployment, I have mentioned some complexities, but here are some more. Their applications came from different departments, different teams. Each team has their own unique view, how they will use Apache, how they will use JBoss. Some teams will use Apache as a load banisher or just a reverse proxy, not web server. Some teams will throw Apache away. They think, okay, JBoss is providing HTTP server already. Maybe it's desirable to unify this, but there is a saying, if something didn't break, you don't want to fix it. If their current configuration is already tested, has been run for a long time, we had better not change it at the moment. That's some complexity. Service discovery and registration have mentioned that. That thing is about workload distribution. In traditional enterprise applications, usually you configure a thread pool, a worker pool, or something like that for scaling. But on a virtualization platform, on cloud, that is not the fact now. Instead of forking new processes, you are creating new virtual machines. And it's not just only about creating new virtual machines. You are deploying your middleware, your application, you configure it from the very beginning. So it's not as lightweight as just forking a new process. So this is a challenge they have realized when they migrate their enterprise workloads onto clouds. It's not just about OpenStack, it's about any cloud. So why they started trying heat? They saw some valuable points from heat. First of all, heat in their eyes is an orchestration tool that says above Nova, Neutron, Cinder, whatever. It's a user-facing tool. And they see the template-based VM provisioning is a great convenience. It reduces the complexities, the workloads for the IT operators or some service staff. So they also see the auto-scaling support from heat, very, very important for them. So just give you an idea about auto-scaling, what it means for Suning in particular. So just guess how long does it take to sell 5 million cans of milk powder here or anywhere? Or 100 containers of milk. So in Suning's case, it's only three days. I'm showing a date there. It's November 11th. You can see the number there, four ones, four singles. How long they are. So when people feel so lonely, they buy milk powder, they drink milk. Maybe that's the reason. But I'm only going to give you some examples. During promotion seasons or even just holidays or weekends, their workload increased dramatically, seven times, 30 times. So auto-scaling is really, really, very important for them. It's a game. It's kind of fun for us, for developers, but for them it's real money. So that's what we learn from Suning. They also hope heat can provide them with a standardized approach, even a process for the application deployment and the orchestration as well. So that's why they adopted the key. So about the deployment, currently we see heat mainly as a deployment tool. It is positioned as an orchestration tool, but there are a lot of things we can work on, we can improve. For example, the post-launch configurations, which is primarily a configuration tool, such as puppy itself, their domain, heat has done some work to bridge these two domains, VM provisioning and instance configuration management. In this domain they have used cobbler and puppet, and fortunately they can bridge these with heat using the software config. But what is very difficult for them is the service discovery and mutual registration. In heat, sometimes you have to express this as dependencies among resources, but heat doesn't allow circles in your template. But sometimes for them they have to do this something on the frontier, then on the middle tier, and then back on the frontier, then back. So these circles are not so easy to be removed. So that's something they are thinking about. Okay, next I'm going to share with you some lessons that Shunning team has learned when they were using heat. Today, heat is more about deployment, as I just mentioned. It's not a full-fledged orchestrator, yet hopefully with conversion work, something like that merged in, heat will be more popular, will be more valuable for our customers. Heat-based deployment only covers part of the story. So the Shunning guys were asking us a simple question. So we hope heat can do everything, but you tell us we still cannot abandon topic, we still have to use Ansible. Why heat? So that's some question we need to think about. It is about the positioning of each tool, each project, but for the end users, what they want? They want an end-to-end toolshade from the basic image to the application fully configured, customized, up and running. So that's a workflow. We have to have some tools to make that process very easy for the users to use. The other thing the team has found out is auto-scaling. I just mentioned auto-scaling is very, very important for Shunning, but they have learned a lot from this. For example, there are defects, there are flaws. The loading update is very important for them, but if you don't specify some properties, it just doesn't work. Auto-scaling, as I understand it, is not thoroughly tested in the community because it's a cross-project integration. To get auto-scaling running, you need heat, synometer, NOVA, and maybe something else, for example, Keystone. So to test this automatically from the community side is a big burden. So we spent about two months on this. We have found some other problems as well, such as if I'm creating an auto-scaling group and scaling it from one instance up to three instances at maximum, sometimes what we saw is the number of instances jumped from one to three directly. Why? Because my second instance was launched very, very slowly. We were installing things on that instance. During this period, we got a second alarm from the synometer. We have no guard there, so the second alarm triggered another scale up. Yes, in heat auto-scaling group, there is a cool down, but that period is not counted into cool down because the second instance is not up yet. So those are some very common cases we found. We need to fix that. I have filed an assignment back to myself. I will fix it. Another problem they have found is the synometer evaluator sometimes works, sometimes just don't. We are still investigating this. For example, at the very beginning, an alarm is in sufficient data status. That means that alarm has not accumulated enough data to make a judgment whether the alarm is triggered or not. Then when the alarm has accumulated sufficient data, it will be changed to OK or alarm state. OK means something you can ignore, but alarm is something you need to pay attention to. So from that moment on, you have accumulated some data to change the alarm state, but in our experiments later on, that alarm may change back to insufficient data again, but that makes no sense. Sometimes something is not running correctly from the synometer sign. So we are still looking into that. So fixing all these bugs is not an easy job. We are still working on that. OK, continue on lessons learned. Previously, when they were talking to us, when the Sunning team themselves were experimenting with auto scaling with heat, they believed maybe their application should be triggered, their scaling should be triggered by network packets transferred, bytes transferred by those metrics, but later on they found, wow, that is not so interesting at all. What is more important is still about CPU utilization. Some things, the workload may eventually translate itself into CPU utilization or memory pressure. So there are some other requirements about how to trigger auto scaling as well. To be honest, the different units, different departments from Sunning, they were fighting each other. They were trying to figure out what is the best equation or the formula to compose a trigger. It's a combination of CPU utilization and memory and something. I don't think there will be a uniform equation there. It will still be an application-specific thing, but there are some requirements there for different workloads. Remember they have more than 100 applications. Rolling update is very important, is of critical importance for their workloads. Even they have to upgrade something, have to patch something, the whole process has to be non-intrusive to their business. Deletion policy. Deletion policy means when you need to scale down or delete some member from a group, that by default policy is delete the oldest one. But in Sunning's case, they would prefer the other way. They would prefer you delete the youngest one. Why? They think the oldest one, the oldest instances may already have cached something, may have already been proved to be stable. Why delete the oldest one? If I'm scaling down, delete the youngest. That's their suggestion. We need some support here. For scaling, they really need the detection, the trigger detection and the scaling operation being done very, very fast, hopefully in seconds. In their experiments, using some complicated middleware, the scaling to create a new instance takes about 20 minutes or so. That's not acceptable for them. For experiments purpose, it's not a big deal, but for them, it's real money. You have to scale up very, very quickly. Okay. Next, I'm going to share with you some wish list I got from our Sunning friends. First is about availability. One problem they have frequently encountered is hot disk error. I don't know if there are some good suggestions from the community, from you, maybe how to handle this. Hot disk today is not so reliable. Each time they encounter this, it means interruption somewhere. This is about the stories. We have high availability. I know there are guys who are against this, but for Sunning, for them, they really treat their workloads as pets. If something is wrong, they want it to be quickly detected and quickly recovered. From our engagement with them, we learned that this doesn't sound like a single-project job from a tech perspective. The failure scenarios may include host failure, stories, network. Everything may become unreliable. Your gas operation system may crash. Your application may be buggy. Everything. Anything can fail. But they are not well-prepared for this. So we need to help them. How to detect those failures and recover them automatically. So for detection, take Nova as an example. Nova has service group today. They are maintaining an internal HSA status of their host. When some host is done, they won't schedule new VMs for those hosts. But this kind of information is not exposed. When a compute host is done, no one else knows. But for Sunning's case, they really want to know this. They want to get a notification there. It's a host failure. Okay. My VMs were running on this. Can I migrate them or evacuate them? So this is discovery. Some of these notifications are already there for the VMs. There are VM life cycle events collected by a synometer. But we need to channel to that synometer to notify heat. Okay. Some VMs are done. Do you care about it? If you don't care, ignore it. If you do care, it's time to do something now. So we believe this is maybe not a single project goal. At least for today, there can be some solutions there built across projects. But eventually, we hope that heat can help do this, considering that there is a huge effort in the community about convergence. If you say, okay, this VM is not going to fail. Okay. When heat detect this, heat will recover it automatically. But that's not today. Eventually, we need this kind of feature support. Auto scaling again. So I've talked about triggers. Sometimes because the auto scaling group is scaling very slowly, considering a lot of post launch configuration work there. They were thinking maybe, okay, tomorrow there will be a campaign. There will be a promotion. Can we launch this huge group automatically at a given point in time? This is more like a ground-based auto scaling that's scaled down in the middle of the night. No one is buying in the middle of the night. They also want some smarter VM placement. For example, having three Apache servers running one physical host. That's not a big deal. If I have all my Apache servers running one host, it's too risky that they won't accept it. So who is going to do this? Is a schedule a hint or something? We are still trying to figure out whether this is solvable. There are also other requirements about scaling about availability zone. There is purely an HA perspective scaling across region. That is about the workload. Sometimes they cannot handle all the workload from a single data center. They have to do this from multiple data centers. So that's a real requirement. So from their experience about using HEAD, they really wish there could be some kind of application profile support from the community. It's not about individual templates and nested templates, a collection of templates. They want this to be configurable. So my knowledge in this area is very limited. Maybe some of you can give us some suggestions. Maybe Solom or Moreno, those projects, are they mature enough to support this use case? I don't know. I'm seeking some suggestions here to help sooning. The provider templates. Early this week, the HEAD core team has presented some advanced use case of HEAD. They have mentioned the provided templates. It's a good thing. It promotes your template reusability. It helps you to do some version control. Sometimes provided templates is not so convenient. For example, it's difficult or impossible to reference resources from another nested template, from this nested template. But they really need this kind of support. It's not an easy task to get the dependency done right. I have mentioned earlier, for example, that the circular dependencies are not permitted in HEAD, but sometimes in real use cases, they need this. This is my last slide. They have found some high-frequent calls to keep engine. They suspect this comes from OS configs, kind of scripts. They're not quite sure yet. If that is the reason, maybe we can make the polling. The script running in your instance is polling HEAD for, okay, is there any new software deployments, any changes to the software deployments? Those frequency can be configured, but once configured, it becomes a constant. What they were proposing is maybe we can have this interval set very short at the first hour or so. When I'm launching things, I'm getting the stack up and running, but later on, it's come back to me every one hour. There could be some new software released. Maybe one day is enough, so that's something desirable. The tools and guidance for them to make this HEAD-based workload deployment standard process is not just only about version control. They need some guidance, maybe some documents would be okay. Preferably, it would be some tools to make the whole deployment a management thing much easier for them. That's all I'm going to share with you today. I'm presenting this for Suning with IBM. Actually, this noon, I'm going to show some VM-HA demo and cross-region prototype at IBM booth. If you have time, stop by. Thank you.