 Hello, everybody. Thank you for coming to the 5 o'clock session. I'm Scott Carlson, senior engineer from PayPal. And today we're going to talk about various pieces of highly available OpenStack at PayPal, both at the infrastructure and the actual OpenStack controller side. I have a couple of co-presenters over here. Raj and Jitang, that will talk about the software side, and I'll focus on the infrastructure side while I'm up here. All of the slides and stuff, as well as our contact information, will be available on the last slide, so we'll upload that after this if you don't want to spend the whole time taking pictures. A lot of people know who PayPal is, but in case you don't, PayPal is the world's largest digital wallet. We do business in lots of countries, 26 currently in 193 markets. And we process 300,000 payments, dollars worth, US dollars per minute. We have to scale pretty big, 137 million users. So because we're worldwide, we have a lot of various reasons for keeping the site up at all hours of the day, so highly availability is very important to us. We are a subsidiary of eBay, ebay.com, the largest auction site, and so we work with them to make sure that our sites are up globally all the time. So what are we going to talk about today? Primarily, why HA is important to us. I think we'll talk a little bit about why we would care about high availability to begin with, sort of at the infrastructure layer and why it matters philosophically. And then additionally, we'll dive into the specifics around what we do. We actually do in production today. We run OpenStack in production, and what we have not yet figured out, what the community hasn't figured out, or what we're still working on. Let's start with why HA is important to us. So when we take a look at how PayPal perceives this cloud platform that we're bringing in, we kind of have five areas that we focus specifically on our cloud area. The first one, no perceived downtime. Nobody can guarantee 100%, but what you can do is you can guarantee that the user won't perceive the downtime. So with the high availability, flipping things in and out of load balancers, making things available to the site, as people are using your cloud, it can look like it's always there, whether it is or not. We're a big enterprise. Everything is enterprise class. We do a PayPal, so our cloud has to be enterprise class as well. When I go talk to my storage people, they run enterprise sands. Hey, is the cloud enterprise? Yes. The cloud that we're running is enterprise class as well. When we take a look at the PayPal business model and you take a look at how we have to scale up to meet holiday demand, everybody is probably aware that in the US there's a couple of really important shopping days every year. There's Black Friday, there's Cyber Monday, and a really important day for us is the last shipping day before Christmas, and that's when our volumes really peak at their highest. So some things that we're focusing on is the ability to scale up our front end to meet those demands rather than building for that all the time. When we scale up, if our cloud isn't ready to scale, that's really bad, and so that's another reason why we're trying to make sure that our infrastructure is rock solid. When we take a look at, you know, we've done a lot of presentations at this and other conferences, one of the reasons why we have gotten into the cloud space into OpenStack, we wanted the ability to be open source and control our environment, which means we're rolling everything into an API. We're putting a wrapper around it, we've open sourced Asgard, we've changed the APIs the way we want it, contributed it back to the community, and those API integrations, those tools that we build, have to always succeed. So that means, again, that if our cloud isn't there, those APIs and those tools we're building around the edge, PayPal in a box, for instance, have to always succeed. And one reason I think PayPal is here today is we made an executive decision, maybe 18 months ago, that everyone is going to use the cloud at PayPal. We are trying to make it not a choice, it's a directive. You will go in the cloud with your new application, you will move your workload into the cloud. We can't say that and have a cloud that's not there all the time. So let's take a look at the infrastructure, right? I'm going to focus on this for the next couple of slides. We can't have a single point of failure under the cloud, right? So the cloud lives up here on a bunch of infrastructure. We're not allowed to have single points of failure. We have lots of data centers, so we have to scale across our data centers. We also have lots and lots of racks of servers. So we have to scale across the racks. We participate in some sort of flexible containers from the big vendors, and we have put containers in the parking lot. We also have to be able to scale into those burst sort of sets of capacity. Within our data center, we have availability zones defined for our restricted data, PCI data, Web Tier, Mid Tier. So we have to build our cloud to also respect that data. One of the things that we take a look at as PayPal is we are required to follow PCI, right? Today there's really no OpenStack PCI certified shared cloud. So we have to consider, do I take a cloud and do I make it PCI? And then take another cloud and make it not PCI. And do I have to separate that at the infrastructure? And then if I do that, do I make it high availability on the front? Do I make it in the back? What are choices that I need to make there? And then really important is because we have requirements for DR, and we have multiple data centers that host our product, we can't have a failure in one cloud that, for instance, runs in the U.S., impact the cloud that runs in Europe. Or we can't replicate and have that replication break impact that other cloud, because we can't take down all of PayPal. If we take down one tenth of PayPal, fine, it'll just fail out and go away. But we can't cascade that failure. Throughout lots of the presentations that I've attended at this in previous summits, people talk a lot about building their OpenStack infrastructure in Iraq, right? A lot of people talk about Pacemaker and all the things, and you have two of something and you sync them and you put them in a rack and you go. We can't do that due to our size and due to some of the decisions that we've made at the network level and the data center level. We've decided to follow a top of rack design for our switching and network layout, where every single rack that we have in our data center is its own slash 24, 23, 22. So there's a different IP subnet per rack. There's different redundant components per rack. So when I take a look at one of the things where I said I have to scale across racks and I have to be able to float that virtual IP or that service across, because I'm layer three, I have to take that above my cloud and above my compute nodes. So we follow the concept of an infrastructure rack. When I look at compute nodes versus infrastructure, what we qualify as if I run a guest VM to serve my customers, I run that on a compute node. Those are off here, we just scale those up horizontally. My infrastructure that supports my cloud and other data center services are in separate racks so that we can sort of, people like this cattle's and puppies sort of thing that has been floating around. We can treat all my compute as disposable and I can just scale it up till I need more. But my infrastructure, I have to treat a little bit more special. I have multiple racks, multiple power, multiple load balancers, layer three. We do not do any layer two trunking between my racks. So every rack is a separate layer three. So we always have to bring that load balancer up above my environment. So how do we do this infrastructure? How do we do these components that we put in here? First of all, we decided that all of our OpenStack services are all VMs. All of our Nova, all of our MQ, any particular service component you would be part of the cloud. And that all lives on KVM that we roll not with the cloud. So we roll that by hand with our deployment tools, Cobbler, Kickstart, whatnot. Every one of these components relies on two plus nodes. We have two racks at least. So it's on at least one of each rack. And then we start talking about the layers. Our physical racks, we keep them in different sections of the data center so that a row can't fail. We also have redundant power and switches. So if we take a look back at one of these, oops, this one. When I take a look at my management rack, every single hypervisor that runs my components has multiple 10-gig interfaces that are run active-passive to multiple 10-gig switches. We have a management network that's shared with an out-of-band communication. And that's also redundant. So every single component at the physical layer and the logical layer within this infrastructure rack is redundant all the way up, including our access layer throughout the network. Because we don't do layer two, we have to rely on layer three connectivity. So we use our network team to sort of follow the rules of active-passive, Cisco, Rista, whatever switches, and then the routing layer of that. And one thing that I think we decided, which is not a lot of people talk about that I've heard, we're a big company. We already have lots of physical load balancers that make PayPal work. So we have decided to use those physical load balancers to load balance all of our cloud infrastructure. We don't need to try and cobble the other way to make HA proxy, for instance, work across racks when I already have a big load balancer there. So I can just create a VIP on that load balancer and use it to load balance my infrastructure. And then I can rely on the network team to keep that working. Question? The infrastructure VMs do not reside on shared storage. Yes. If we take a look at this, I have a note at the bottom of the picture here. Our compute is all hyperscale, highly dense compute, 16 cores local disk. And so we keep everything local to the machine. And we have decided that if that fails, we will deal with it elsewhere. We won't keep the storage shared. We have a couple of things we share, like in Puppet, we share the certs on NFS and stuff, but we don't put the VMs on shared storage today. Moving over specifically to the compute side, we need to... We sort of have a rule of PayPal where when we build VMs, we build lots of them. In certain pools of compute, we have 3,000 of the same thing. But I can't put all 3,000 on one rack. So I have to find a way to scale my infrastructure to handle this appropriate set of failure domains. If I want 10% of my site to fail at any given time, I need 10 racks. I put 10% on a rack, et cetera, et cetera. The same design applies in this situation. In this picture, I've called out 3 things that are sort of important in how we have built our data center design. At the access layer, again, multiple switches, multiple routers to get the traffic routing where appropriate. That goes to a separate set of physical pairs of load balancers. So in this case, the four racks stand for four fault zones. Two fault zones are controlled by each pair of physical load balancers. And so in this model, we have a possibility where half of it could go away if all of the load balancers fails. But it would only go away until the load balancer failed over. But then within it, again, it follows basically a rule of two. But it's an n plus one model in the data center design perspective. We definitely try to keep our compute skew similar across our data center. Like I said, we have 96 hyperscale nodes. There's 1900 physical cores in this rack. I think that's what 96 times 16 is. And then if we need more, we just, again, treat them the whole cattle thing. We just scale them horizontally, plug them into whatever load balancer they're appropriate for, and use our open stack scheduling and other tweaks to move the workload across these racks as possible. Wanted to talk a little bit about the specific compute node that are in these racks? It's kind of mocked it up. So at the top of the rack, these multiple 10 gig switches, active passive switches. We don't use LACP today. We're just getting into that to do active active switching. We're going to trunk them into a 20 gig instead of a 10 gig. At the server itself, we have multiple cable 10 gig interfaces bonded at the Linux level as bond zero active passive. And then the one gig port is a shared management nick with our out of band network. So if the server is off, it's used for an out of band network. When the server turns on, it also gets an IP address for our management network. And that's used to communicate mostly with our cloud stuff. Go ahead. It's purely a sequencing thing. We looked into some active active because we didn't have the LACP problem fixed at the switches. There was a code problem on the switch. We didn't want to go active active below that yet because we didn't have both ports active above it. The way my network team did it, they said it was. So we moved on and worked on something else. Oh, shoot. That was going to look good when it was big. So what this is, is we started to mock up a couple of months ago VMware released a port flow chart. That looked a lot like this, which was every single port in every single service that VMware and vSphere used. So I mocked up the similar thing with OpenStack. So what this is, is every OpenStack component, all the ports it uses, and how they communicate through each other with our load balancers. And unfortunately, you can't see it, so Raj can't refer to it. But Raj Getta is going to talk more about what these components are and how we treat the high availability of each one of them. Can we switch to Raj? Hi, everyone. Thanks, Scott, for the infrastructure presentation. Let me switch back to the previous slide. This is our typical layout of our OpenStack services for our existing infrastructure, what Squared recently presented based on our hardware infrastructure. So as he said, we are enterprise class, so we have already an existing F5s, which load balancers that we already use for our PayPal, which we are making use of them and making our OpenStack to be HA, and also scale out for the applications when you have a massive load. So in this typically, we have all the components like Quantum, Nova, and every component of Nova, like API, Nova API, and Rabbit, MySQL, and everything is behind the load balancers. So the rule of thumb that we have is like everything should be available. So we go with like everything is behind the load balancer and having a VIP for every endpoint for OpenStack. And we have lots of other considerations when we are doing this. We have to tweak a little bit of some of the APIs to make sure that they are high available all the time to have an active-active mode. So the things that OpenStack we run today is all active-active. We are not running any active standby, so we are... We may get the problem by tweaking some things in OpenStack. And we have some of the problems when we're doing, especially talking about Rabbit. We are using Rabbit-HA method, active-active methodology to sync all the message queues. And when we try to do that, we have some problems that Rabbit doesn't do the stale connections, which we mitigate the problem by closing the connections by running a script schedule like every five minutes. And we can't really use PageMaker and all these things because of our infrastructure is not supported right now. The layout of that infrastructure is not in the way too, and we have a rule of saying that our components of OpenStack should be in separate racks, which we are having high availability there from a hardware level. So we're not using that, and there is no requirement for us to use because we already saw the problem by using a load balancer there, a physical load balancer. And some of the services that we have sometimes go like not responding and goes like not behaving good. So we mitigate the problem by using monitoring Xavix and have them want to restart them self-service to have like bump it back. And talking a little bit about MySQL, we're using MySQL active-active replication, multi-master replication. Though we are using that, today we are using a persistent connection to one active MySQL server because of some of the issues that we ran across of race conditions, especially like with Neutron Quantum hitting multiple active servers of MySQL databases. And Swift cluster, we are using the native Swift cluster. We are not doing anything separate from the community. So continue with heat. Heat is a little tricky. So the one thing we compromised only for heat because it's been like we started using from a couple of months, we have compromised at this level because we have not really worked on how to make it work behind the load balancers and we're using CoroSync and FaceMaker to get the high-valuability for now, but we are focusing to work on more on it to make them as an active-active, which we saw some of the things that Havana supports. A few things on HA for heat. We're going to focus on that to improve that. And talking about all the components like Keystone, as I said, like Nova, Glance, Swift Proxy, Quantum API services, and Rabbit Cluster are purely behind the load balancers by having some of the code modifications. And the standard volumes, talking a little bit about that is ZTang is going to come into the stage and going to present what we have done with standard and what contributions we are doing with standard. So I'm going to talk about what specific problem we have been encountering in standard volume service. As a standard developer, I have been seeing bug reports, user inquiries from MarinList talking about how standard volume service doesn't work with Swift. So I'm going to use two slides to talk about this problem. This is a simple workflow to describe how a create volume request has been served by multiple components of Cinder. So at first, user sends a request to Cinder API. And then the API talks to the Rabium Q or EMQP and then sends that request to the scheduler. The scheduler then makes a decision which Cinder volumes or which large backhands is appropriate to serve this request. So it takes 1, 2, 3, 4, 5, 6 steps to finish a volume create request. So if we take all the components into an HA mode, as you can see that only Cinder API service are behind the load balancer. So this is how we implement HA for all the Cinder components. The major components are Cinder APIs and the scheduler and then the Cinder volume because the database and Rabium Q has already been... The HA problem for Rabium Q or database has already been solved by Raj. So let's take a look at Cinder APIs and the scheduler and Cinder volume. So API is a status service. It works perfectly with load balancer under the VIP. But scheduler is another problem because how Cinder API talks to scheduler is via RPC, right? So we have multiple scheduler listen to the same channel. Each message sent out by Cinder API will be distributed to these schedulers in a round robin manner. So we don't actually put the VIP in front of scheduler. And then when it comes to Cinder volume service, there is another problem because when one volume service serve a create request, it will mark the host of that specific volume in Cinder database. So next time when the consecutive request towards that specific volume comes, Cinder API will try to look at the database and then find out which volume service he has to talk to. So in this manner, if volume A is served... if served by Cinder volume service A and then that volume service is down and then the Cinder API service will have no volume services to talk to because according to the database record, there is no volume service behind the volume or in front of the volume. So how we solve the problem is that we modify the configuration for Cinder volume and then try to make multiple volume services use the same host name. So we rely on the AMQP or the message queue as the same as how we do it with the Cinder scheduler. So when the message comes from the scheduler, try to talk to a specific volume, these volume services are all identical. They all look the same from the scheduler perspective. So volume service will fit out the message from the message queue in the wrong robbing manner and then they are connected to the same storage backend. They are happily served that request and then if one goes down, the other can take... Okay, question? Yeah, so active active for volume service and then active active for scheduler. Yep, if the question is for active active setup for Cinder volume, there may be some race condition. Yes, there might be. If the requests are towards to the same specific volume and then the order between the request that matters, they may have some race condition either on the backend or the DB. So right now we don't have a good resolution for that, but as a Cinder team member, we are aware of this issue. I have been working with HP and other folks. We all realize this is a problem and then we'll try to address this issue by adopting the task flow. We try to make every single state change implemented by task flow and then we can fall back. And then we try to make an update atomic, a log or some other mechanism to update the database. So that's going to happen in Icehouse, yes. So these are the unsolved problems by the previous slides, so I'm going to quickly go through all of them. The first one is our Cinder volume service, actually the heat or the Cinder schedule service are not so VIP friendly, but we can work around that by adopting the active active mechanism, the natural of Rubimq to try to work around this issue. And then the second point is that people have been talking about trying to do a rolling upgrade for the OpenState components, but different projects, different components have different implications and then there is different problems resizing different projects. Take Cinder for example. The order, if you have been to the Design Summit session today for Cinder, you have some idea about why has been a pain for HP to do rolling upgrade for Cinder. The same issue, is this there for PayPal? We are not fully capable of doing seamless upgrade right now, but we figured we have work around to live with that. And then the third bullet is, right now if a DB transaction fails, there is no way to reconciliate that DB transaction. I think that problem may be solved if all the projects have adopt mechanism-like task flow to take each sub-project or sub-job into a task, into a transaction. It's not just a DB transaction, it's a transaction in OpenState API level. So we may proceed, retry, fail, fall back, stuff like that. But we are not there yet. Hopefully in six months we can make some progress. And then the final bullet is that we are expecting to have consistent API response time or latency. So if one request has much higher response time or latency than the others, do we or do we not consider take that specific OpenState service out of the VIP, out of the HA? Do we perform an HA action if response time spikes or similar things happened? So these are our open questions, and some of these issues will have to rely on the OpenState community to solve in each project. And some of them maybe just need some other efforts out of OpenState, the code base. So that's it. That's how we do HA in PayPal. This is the whole PayPal call team. Yeah, so that iChart of all the flowchart, I posted that to my Twitter feed yesterday. If anybody has any feedback on that, please send it. We're going to try and contribute that diagram and keep it updated as well as anybody on the team here. Feel free to reach out to us. A bunch of us have Twitter feeds. CloudedPayPal.com gets to the front door of contacting us. Happy to work with anybody here over the next couple of days as well as in the U.S. we have international presence everywhere. And we can take some questions if anybody has any. We'll be around for a little bit longer. So the question is how much memory do we use on our compute node and the average amount of memory that's used. So if you take a look at this, we put 256 gig on every compute node. We have used up to 240. If we build a bare metal VM, we use 240 because the rest is consumed by other stuff we run. So I guess we don't oversubscribe RAM. We choose not to do that. And then we consolidate, depending on our workloads, anywhere from one to one to four to one. Like our QA testing environment is around four to one in production. We try not to get any higher than one and a half to one, virtual to physical. So that works out to... A lot of our workload is four physical cores equivalent. So it's seven to one maybe, six to one. On average, I think it's around six if you normalize the environment. Six VMs per compute node. Purely due to their production size. Yes. We're trying to increase that with various reasons. And as we move to Java, we'll be able to go smaller and bigger. HA of the network services, Nova Network and Neutron. So we today rely on VIPs for the front end to those services. And then we rely on NYSERA as our solution and follow their best practices for the service, the gateway, and the other ones to ensure that they fail over appropriately within that network. Anybody else? All right. Feel free to give us any time. Thanks very much.