 你好,对不起,我的成文说得不好, and we thank you for your kindness and letting us continue in English. Thank you very much. Thank you for coming today here to the practical lessons from building a highly available open-stack private cloud. One thing right off the bat, all of the presentation slides, because people are always asking for presentation slides will obviously be available on the conference website later tonight whenever the organizers actually upload them, and each of you is very much invited to use these presentation slides for any purpose you wish under the terms of the Creative Commons Bias A3O license, which means you can use them for anything you want as long as you quote your source. Let me start by introducing to you Sebastian Kachel. Sebastian Kachel is one of the folks on the open-stack team at Pixel Park. I've had the pleasure to work with Sebastian and the rest of their very, very sharp team for several months now, and Sebastian is based in Berlin, Germany, and this is as about two-thirds of the people in the keynote this morning, his very first open-stack summit, so we can count him as a very new addition to our large and growing open-stack community. And this is Florian. Florian is H.A. Stowage, Cloud Guy, Consultant and Instructor. He is the Texas Co-Founder and CEO based on C-10C economy class, also similarly returns based home near Vienna in Austria. So what was the challenge to solve? We unsure the services that we provide with high availability, and so we must do this in our private cloud, too. And this is actually something that is a very typical challenge in the open-stack community. Generally speaking, open-stack users fall in one of two different groups. One group is the one that is primarily looking to open-stack as a massively scalable cloud platform for maybe a handful of applications that they are building. This is kind of like the group or the company or the team that is trying to build the next Twitter, or something of that nature that has equal scalability needs. What these people have in common is they typically have the luxury of maintaining only a handful of applications, which they can re-engineer if they so choose at essentially any given point of time, and the only constraints that they have are maybe time and budget. And then there is a wholly different group that looks at open-stack as a modern way of running a data center. And there you typically don't have the luxury of having maybe half a dozen applications that you support, but maybe half a thousand, or more than that. And in the first group, you have this luxury of, if I want high availability in my application, I can build it into the application, or into the handful of applications that I'm running, and that I don't care whether my cloud infrastructure itself provides high availability or not. If, however, I'm running open-stack as a way to run a modern data center, then I do have the expectation that my cloud infrastructure provides at least a certain degree of high availability simply because I'm not in any way capable of re-engineering a thousand or maybe more applications that I am managing. And as Sebastian is going to explain to us in a moment, Pixel Park, the company that he works for very firmly falls into the second category, so they're very much looking for an open-stack infrastructure that can provide this high availability to them so they don't have to worry about it at the individual virtual machine, the guest lab. Yes, so we are the different thing and we must bring the HA in the infrastructure. So let me take a little one about Pixel Park. So Pixel Park is a full-service agency for team communication and e-business solutions with the following departments like concepts, project management, editorial design, developing and hosting. So today we will show you the solution how we built HA in our private cloud. So why we use open-stack? There are so many other cloud softwares and I explained we are a full-service agency so we need the benefits of cloud computing like on-demand, scalable, elastic. It's very good that open-stack has fixed time-based release cycles. It is open source so we don't must pay for license. We can get support. It has a rapid development and this is a very important thing is that open-stack is a cloud software that goes beyond infrastructure as a service. So for a full-service agency with development as a department it's very important for us because they can use it for platform as a service and we can provide the services for the other departments and we make or we can make it highly available. So why we should use it's high available? I said that the other services that we have on physical machines that we provide the services are available and we must do this in our private cloud too. So in detail we provide service level agreements up to 4.9. So when we talk about 4.9's availability that's obviously less than an hour of cumulative downtime per year. The important part here is that if you have an SLA that mandates this kind of uptime then that means if you're thinking you can handle outages by having a person on call and you page them and then they log in and they fix it and you still think that you're going to be able to achieve 4.9's over the average of a year or maybe two years you need to think again. That doesn't quite work that way. If you are under an SLA that provides 4.9's capability or even 99.95% capability you are going to have some automated fashion of recovering from failures. That is to say you do actually need to build some form of automated high availability, automated failover into your system. If you're thinking anything else, if you're thinking you can guarantee a 4.9's availability SLA without some form of high availability mechanisms you're on the wrong track. So really very important stuff you actually need to build automated failover solutions into your infrastructure if you need to maintain an SLA just like this. And this is a very, very common criterion for a hosting provider, for a private cloud. There's very few customers that will accept less. And so for anyone who's building a private cloud that is an absolutely important concern. So how did we do this HA and OpenStack? First I will start on the base of the system and this is storage. So we need a high available storage and it must be scalable. And so we use CEPH. CEPH is distributed storage platform designed for excellent performance, reliability and scalability. So it guarantees real liable storage with no problems. And so no data lost. So what we should store in the CEPH cluster I think or re-sync, all it's impossible. So we store Cinder volumes, Glance images, Static data over Radar's Gateway and the three and we store instance data. So CEPH is an excellent storage, cloud storage for us. So how did we build it? You can configure CEPH or you can configure the copies of data in CEPH and we are rocking with three copies. So 66% of the physical nodes can crashed and all data is available with three copies. We used one disk per OSD. So under the OSD is no write, they are standalone because you don't need it. Over the OSD is a XFS file system and we put the journaling on separate solid state disk because it's faster. So every storage node has eight gigabyte ports and trunk mode because you have fast connectivity to these storages. And most of these are just basically pretty much standard best practices for any CEPH cluster. Can we have a quick show of hands please, who in here has built a CEPH cluster either in testing or production at some point? So as we can see the technology is such as getting traction because when I first asked this about two summits ago there were significantly fewer people in the room that were raised their hand. So this is a relatively cookie cutter straightforward CEPH OSD configuration. This is using the XFS file system as a sort of a balance, a trade off between file system features and file system stability. If you want the best optimized file system for the CEPH store as such for the CEPH OSDs you will probably go with ButterFS. But then you might be the kind of person that says well ButterFS is two years away from production and has always been and will always be, but only if you're cynical. But I think we all agree that ButterFS is currently in an experimental state. So XFS as an OSD for production system is generally a really, really good trade off between performance on the one hand and stability and reliability on the other hand. This whole thing with one disk per OSD, one thing that's nice about CEPH clusters is it takes care of all of the data redundancy for you. That is to say you can run your CEPH OSDs on simple JBODs so you don't really worry about rate controllers or that kind of stuff anymore. And the part with the journaling on a separate SSD is essentially performance consideration. CEPH when used with XFS uses write ahead journaling which means that all writes first go to a journal that determines your write latency and then later on they actually go to the actual file store. So if your journal is reasonably fast such as when you're running an enterprise SSD that generally speeds up the performance of your CEPH cluster as a whole. So like I said all of these are sort of general best practice recommendations that most people choose to follow when they deploy CEPH with the XFS file system. So this is the storage. The next layer I will explain is the open stack block storage. So this is Cinder. The Cinder services are Cinder volume, Cinder API, Cinder schedule and we put it into pacemaker. So pacemaker monitor and control these services over these two nodes. The services are running in active backup mode that means they are only running on one node or services and when one service crashed pacemaker moved it to the other node. The connectivity, the network connectivity to these storage gate rays is four gigabyte ports in trunk mode. The reason why it's actually relatively simple to put these services under a pacemaker high availability manager is that when used in conjunction with CEPH RBD, so that's the RayDOS block device that Cinder uses here for backend storage, the services themselves are essentially stateless. So the services themselves don't keep any local state about themselves anywhere except in the relational database that they write their persistent data into. The RabitMQ or AMQP in general message queues that are used to communicate between the services and the actual data lives in the CEPH store itself. So that means failing over a service, a Cinder service, a Cinder API scheduler or a Cinder volume service is a relatively simple process of just firing up the processes, the services themselves. There's nothing that we need to worry about in terms of actually getting the state over because the services as such are inherently stateless. So that makes the failover actually very clean and very fast and specifically in conjunction with CEPH RBD that's a very elegant solution. That is not necessarily true for all Cinder back ends. So for example if you're using the standard LVM backed ice guzzie of Cinder backend that is significantly more involved because then you do have local state that you do need to failover. But for CEPH RBD that's actually really nice and elegant combination that fails over very cleanly and very nicely. So this was the second layer block storage. The next layer is network. So let's have a look at this. Report the services on pacemaker 2 because it's very good and pacemaker monitor it and when the service crashed pacemaker moved it automatically so we must do nothing. The quantum DHCP agent is configured in active active mode. So the you must set a parameter in the quantum server and this is very good. The L3 agent is in active backup mode because we have trouble in grizzly release to run this on multi host in active active mode. So it's running on one server and when it's crashed it's crashed it moved to the other. The open V-switch agent you can run in active active mode. So a few additional words about the then quantum now neutron open V-switch plugin agent. What pixel park is using in this configuration is the OBS plugin in GRE tunnel mode. So in other words the reason or the consequence of this being all in active active mode is we permanently have these tunnels established between all of the compute nodes and the network node. So in effect it's one gigantic virtual switch or actually one gigantic virtual switch per tenant that we can then just plug VMs into. That is an infrastructure that basically lives all the time. When a compute node happens to crash and we need to bring up the specific or a specific set of guests on a different node which is what we're going to talk to in about in just a second. That compute node has access to the same virtual switch i.e. the same set of GRE tunnels and all of that is managed very nicely and automatically by the quantum open V-switch plugin agent or as it would be called in the present Havana release the neutron open V-switch plugin agent. So this is the third layer, the network layer. The next layer is services and APIs. So we put it in pacemaker 2 because it's very good and we the services are distributed. So horizon is running on control node 1, quantum server on control 2 and something like this. And the important is that we put my SQL and Revit MQ in DRBD. So you have one primary DRBD mounted on for example control node 1. And the second one is in secondary state. So the primary is in secondary state and the primary is in the data to the primary. And when one service crashed like my SQL pacemaker move the services the IP from my SQL mount the DRBD and start the my SQL service on the other node. One thing that applies here is there's more than one way to do it. You don't necessarily need to do this with DRBD. Another option would have been to use another separate RBD volume out of the CEP cluster. And another option would also have been to use replication facilities built into the applications themselves. My SQL, the my SQL database engine offers Galera which uses right set replication in a multi-master node between multiple nodes. In on the Revit MQ side there we could have been using mirror cues in Revit MQ. DRBD was chosen here for reasons of simplicity and stability but not necessarily because it is the one true option that you can use to deploy here. Like I said there is more than one way to do this. Another thing that should be mentioned is that most of these services can be run in both an active passive and an active configuration. So for example the Nova and Quantum API services would be classic examples of services that can run in multiple instances on multiple nodes. The nice thing about the pacemaker cluster stack is it enables us to do that very nicely as well. In pacemaker we can define a cluster resource as what is called the clone and in that we can essentially say give me four instances of Nova API. I don't care on which of the nodes in this set of say for example eight nodes you run them but give me four instances of them. So we can use pacemaker features here for both classic failover high availability and also for a certain amount of scale out. And then can that be combined with a load bound sort of like HAProxy or other things as well. So you see the pacemaker cluster with two nodes to keep Horizon Keystone, Glance, Nova, RabbitMQ, Quantum Server and the MySQL database always on. So this is the services on API layer. The next and last layer in our infrastructure is compute. So we use a pacemaker cluster and the special thing is that the instances are running in the self cluster. So we mounted you over over RBD under the standard default pass while live Nova instances and we create our own pool in the self cluster. So you should don't use a two large pool like five terabytes or something like this. I don't know why but it is better to use a smaller because the fried and IO process are not good in our environment. And that it is fast you will re-need six one gigabyte ports in trunk mode. Again, here the same thing applies that I said earlier. There is more than one way to do it. What anyone that is familiar with Nova realizes from this immediately is that because the stuff that normally holds the ephemeral data, the throwaway data for virtual machines is itself on persistent storage. There is no such thing as ephemeral storage. Everything by default is persistent. You could also employ a strategy where you say by default everything is not persistent and when you want a whole virtual machine to be persistent you have to boot off of a Cinder volume that is perfectly supported in Nova. And one other thing that you may be wondering about is why use an RBD mount of our live Nova instances that you then have to put under pacemaker management and things like that when you could just use the set file system and mount everything directly. Again that is a trade off in terms of stability. The set file system is considered experimental at this point. The downside of actually mounting an RBD with two and that is available to multiple compute nodes is you have to use some sort of high availability manager to do that. In pixel part case that HA manager is already there and it is just an additional resource that you plug into it. So that makes it very simple. An alternative case for using exactly this thing is Sebastien. Could you please raise your hand real quick? Sebastien An who is in the fifth row here has written a very interesting blog post on how to do this with Cephavest. So that is another alternative approach to doing that. So this is our full HA open stack. And the last thing that I will say is how did open stack affect our organization? Let me just quickly add one more thing. What this actually means just in case it is not 100% clear is that you can actually kill a compute node and the virtual machines will come up on a backup compute node, on a secondary compute node. So you have actually got your fully persistent highly available virtual machines. Which is something that unfortunately you can't really do with onboard components in open stack. So it is not something that is actually built into Nova itself, but it is something that you can build using a high availability manager that interacts with open stack. So in this case you can actually kill a compute node and the virtual machines that are on that compute node will all fail over to another one and will continue to live there and be available for you, which is kind of like what you want in a data center infrastructure. So the question was, is that four separate two node pacemaker clusters? Yes, it is separate pairs. But with the option of obviously, for example, the compute cluster you can easily extend. But yeah, you get the idea. I mean, Tim works on this stuff, so he gets the idea anyway, but that was just a rhetorical figure of speech. We had another question back here. Okay, yes, the question was, if you kill a compute node and you bring a compute node back up, someone has to take care of starting the virtual machines on that node. Nova actually does take care of that. We have a Nova option that is called resume guest state on host boot, which means you bring up the host. It will then check in the database, okay, which are the machines that are supposed to be running on that node, and check that against the local LibBird state and whatever is not running will get fired up. And the only little trick that you need to play is you need to override the host name that goes into the Nova database with an opaque host name. And for that, you have the host parameter in Nova.com. So that's the whole trickery around that. So how did this, our organization, to implementing an OpenStack environment is a challenge. And it's a very good way to get training before you install OpenStack and get support when you have a live environment. So now we enter quality, we work efficiently. We have a programmable infrastructure. And this creates a basic for further iteration because we can test software or infrastructure in OpenStack in minutes and then we destroy it. And we are ready for up and coming technologies. And now we sponsor an OpenStack user group. And the important thing for operators, it makes a lot of fun to work with it. And I had a lot of fun. I and my team had a lot of fun working with these guys as well because they were really, really sharp and nice to work with. If you have further questions, you can, of course, ask them immediately after. But I realize that there is apparently, according to rumors, there's people with beers waiting outside. But by all means, if you do have questions, get in touch. Basi's email address is right there. Mine is at the bottom. It's very simple. First name, last name, at company.com. So Sebastian.cachel.pixlapark.com. And mine is Florian.Hoss at hastexo.com. You can also find our company websites, obviously, at pixlapark.com and hastexo.com. And you can also find us on Google Plus and you can find us on Twitter and wherever. And you can find us, obviously, in the attendee directory for this conference. And with that, we will be happy to take your questions. But before that, let us quickly say, thank you. Thank you.