 So, the presentation is about our experience at eBay, building a private cloud. It will be a two-part presentation. I'm Jesse Martin. I present Cloud Architect at eBay and I present the reason why we built a private cloud and what are the components that we extended in OpenStack to make it work. In Subu, we talk about why it's not enough and Subu is leading our engineering. So the main reason why we built a private cloud at eBay is agility and you can go back to our analysis conference presentations made by all our executives in 2013 and you can see that every one of them listed accelerating innovation as the primary goal for eBay. And you can also look at our CTO's number. You see that the rate of innovation doubled between 2011 and 2012. So what this means is the number of features that developers were able to push to the site doubled during that period of time. And this means that we had to improve developer agility and our developers can develop code and push it very quickly to the site. So the first high-level goal for us was to enable agility. And the second reason is that we wanted to build an infrastructure that could support all the various parts of eBay. So you may think that eBay is this weird entity and it's made of many different adjacencies or companies that we acquired over time and each one of them have either their own environment or they have a portion of our data center dedicated to them. They might have even their own ops team and their own practices to manage the site. In addition to that, each one of those companies or entities have their own sub-environment like one for QA, one for their secure applications, one for their production application and so on. And you can imagine that if you had to build separate infrastructure for each one of those environments and companies, it would get out of control very quickly. So the main goal for us was to provide also an infrastructure that could be shared between all those entities and provide the same level of isolation and agility that they used to have with their dedicated environment plus more efficiency by consolidating everything. So what we did is we built something that really looks like a public cloud except that it's running inside our data centers. What it means is that it's frictionless. You don't need to have approval tickets or anything to get an account and start deploying VMs and deploy your application. It's isolated, multi-tenant. If you are an adjacency or like for example PayPal and you want to run your application on our cloud, you need to have the guarantee that you're isolated and you are in control of your application and you can prove that you are the only one to control this infrastructure. And we wanted to have the experience of a public cloud which means that you don't see all the seams between different environments, right? So it's a large cloud that can accommodate all the infrastructure, all the requirements for all those companies, not like many different clouds that would be dedicated to each one of those companies. So really it looks like a public cloud. So what it means, the result of that is that we, from June, May 2013, you can see the growth of users and you can see that at the point the number of projects is increasing because we gave users the possibility to create multiple projects to deploy their application in multiple different environments. You can also see the start of distribution of users between all those different entities that I talked about. So the big one is eBay, the smaller one next to it is PayPal. And then there's all these other entities that are starting using our infrastructure. And you can imagine that it's a lot of flexibility for them to have this shared infrastructure instead of deploying their own dedicated infrastructure and managing it. The number of VMs created since May, it's 100,000 VMs created and deleted. And you can see that the curve is like a hockey stick. And it's kind of scary because in terms of capacity and pressure on the team, it's starting to be quite huge. And this is the number of cores. Every week we are adding machines. And in order to build a real cloud, it means that you cannot tell users, stop provisioning VMs we are running out of resource. You cannot do that. So we have to make sure that we have a pipeline that is efficient and we can bring machines every week into the cloud to adapt to that demand. And I think that we are like 15,000 cores in that environment as of last week. So what did we have to do in order to build a multi-tenant cloud on top of OpenStack? So you look at the blue components are entities that already exist in OpenStack. So on top of the infrastructure, we define regions. So that's a concept that already exists in OpenStack. But on top of it, we need to define availability zones. And availability zones for us is similar to what Amazon defines as an availability zone. So I think that allows people to define isolated environment where they know that faults will not impact the other environment. VPC is really the same concept that what Amazon calls a VPC. It's a dedicated cloud for one of those entities. So it's a coarse-grain multi-tenancy concept that allows, for example, like our GSI or their entity to have their own self-managed virtual private cloud and they will either give it directly to their users or use another mechanism to deliver those VMs or projects to their users. And on top of it, you have projects that are the typical OpenStack project concept that you know about. So when we double-click those, what does it mean? So we have large-scale networks that are shared across all those projects. And we abstract the network and we partition it using virtual networks. We have multi-tier storage, or we soon have multi-tier storage, as soon as we deploy the Cinder component that supports multi-tier storage. And they can be used either by block or object storage. And we have commodity hardware because we need to have also a cost-effective cloud. So we have the same constraints than a public cloud. We have to not make revenue, but at least be cost efficient. And on top of it, we have different flavors. It could be flavors for front-end type of workloads or all the way to HPC type of workloads where we have our analytics team or adopting or running some analytics workloads. And we support Windows or Linux images. And we have two styles of images. We have images that are commodity-provided. And we don't control what people put in those images. And we have images that we customize in order to implement some best practices or integration with our infrastructure. So on top of those concepts, we have the VPC concept. And each one of those objects will be mapped to a VPC. And we introduce the notion of class of service, which is kind of a set of policies that you can apply on top of a VPC in order to give it like a different behavior. So VPC, you don't really know what are the security policies. It's up to you to define them. But since we provide shared VPCs, for example, for developers, we know what they are allowed to do and what they are not allowed to do. So on top of those VPC, we define configuration templates, if you want, of policies that then customize access control cost or even location in the data center. And we add in the OpenStack API or OpenStack component, we add to multiple metadata, if you want, to the various resources that OpenStack offers in order to map those resources to the notion of VPC. So for example, a project for us can be assigned to a VPC. So you create a VPC and then inside you might have different projects. And each VPC has a different set of projects. You can assign a network to a VPC. So a network would be private to a VPC. So if you have a developer specific VPC or a PayPal VPC or a GSI VPC, which are entities in eBay, they will own that network and no one else will be able to get on that network if they decide that's their policy. And in the dashboard, we had to customize it in order to allow users to select VPC or Availability Zone. And do some modification of the scheduler in Nova in order to allow the selection of networks, either based on capacity or the previous tags that we put for VPC or class of service. And in images also, since we customize image for integration with a specific environment, we have to also specify if the image is specific to a VPC or if it's an image that would work in any VPC. So I can explain a bit what kind of VPC we have. So we have one VPC, which is like a public cloud, it behaves exactly like a public cloud. You push your application in that VPC, you create a project, you push your application and it's directly available on the internet. You can build some experience and immediately expose it to users. But it's completely in a DMZ. You cannot access any of the internal resources of eBay. It's really like if it was a public cloud available across the internet. But also we have a VPC that is like your desktop and an extension of your desktop. So you can run a workload, for example, for development or for testing of an application that you are currently developing. And you can see, for example, that in one, we can have integration with our LDAP for allowing users to directly log in their VM. So our image have an integration with LDAP, corporate LDAP, so that they can directly log in with their login credentials. But if it's the public version, like the public VPC, they are completely shielded out of the corporate network. So they cannot use their corporate LDAP credentials. They have to upload a key and use SSH keys in order to log into their VPC. So that's the type of difference that we had to build in our image. And then we have to tag those image with specific capabilities and specific compatibility with VPCs. So that's at the high level what we had to do. And I'm going to give the mic to Subu, and he's going to tell you why OpenStack is not a cloud. So thanks, JC. I took over the cloud development as a chief engineer over last year. And one of the first things I learned in the last few months of my experience, starting a cloud was that OpenStack is not cloud. So a couple of months back, I wrote a blog post that making a point that it was not cloud and most folks that have built private clouds, large clouds, get it immediately. Yeah, we know that. But if you ask a lot of folks who have built smaller clouds, less than 50 machines, hypervisors, that's not very clear. A lot of people see that OpenStack is a cloud in the box. You get, you deploy like DevStack, boom, it's there. So the key difference is that what our customers see, because at eBages, like JC said, we think and act like a public cloud. That is, all the APIs that OpenStack has are open to anyone. Whether it's creating VMs, taking snapshots, bringing your own images, write your own cloud scripts, whatever. And what they want is an API abstraction that lets them do what they want to do without permissions and controls. In two minutes, they want to VM, most common use case. What they also want is cloud as a service. They don't want a software. Artenants don't like to deploy and maintain soft OpenStack deployment. What they want is a service like AWS, like Rackspace Public Cloud. So from here to get there involves a lot of things. That is totally outside the purview of OpenStack as a core. For example, even before to get an OpenStack up and running, we spend time on network design, infosec, deal with firewalls, getting the builds up and running, we have CIFR changes on board the infrastructure. Like JC was talking about capacity addition. How can we add capacity quickly? Can I add a half rack in 30 minutes? So that's the time we have to spend automating and building the tools. Config management, we spend a lot of time automating our puppet infrastructure to make, to push changes to production. That's not what customers see. What customers see is the blue box, blue circle. They want APIs. But what happens behind the scenes? Same thing, high availability. And in order to get a good time to recovery, we have to also invest a lot of time in log processing, metrics collection, monitoring of the cloud. Not the VMs, but the cloud infrastructure, alerting, incident resolution, like things go wrong, how do you communicate users? How do you deal with their availability? And then, of course, user experience, that it is seamless. So we decided to expose, in fact, a likely customized version of Horizon, so that they are in charge of all interactions, customer support, spending time with developers to explain how to use it, how to get around, or how best practices. SLAs, defined SLAs so that they can trust us. So the journey, what we noticed in the last year was that for the first six months, it was awesome. You get a VM in two minutes, or a block storage of a terabyte or whatever size, and that was agility. Now, what people want is availability of the cloud. So we have to set SLAs. Not just that, we have to build mechanisms to measure SLAs. How do we know that we are up five nines or four nines? The, in addition to that, upgrades. So for example, last year, we were on SX, S6 to Folsom, was a nightmare. We had to literally forklift each VM from an S6 cloud to Folsom. That's a lot of work for us. Capacity planning, how do we know that we are running alone capacity, how do we onboard capacity? How do we scale out remediation? Like hypervisor is alerting that there is an issue with their memory or desk. How do we deal with that? Auto-scaling, metering and charge back, monitoring, alerting. Those are, again, features that customers want. But ultimately, what we want is a cloud service, not as a software. So that journey is quite fun. But also, you may not know that you have to go through all that when you start out. At 50 hypervoiders, you won't see it. When you have more than 50, go to 100, 150, you start to see the pain or investments that you have to make to run a service, as the cloud is a service, not as software. So we thought we could talk forever on each of these bubbles, and there's a lot of work. So we picked up some topics that we experienced in the recent months and want to share our experience and learn from this conference. One is we look at log processing metrics, how we do collect metrics, how do we do alerting, what's the pattern for scale out, so that we provide availability for the business, as well as a scalable infrastructure of the cloud, and how do we deal with things like builds and config management. So let me start with the monitoring. We approach this problem in two angles. When I started this effort, I was working alone as a developer on the team. And usually something like this happens. A user starts, hey, I want to boot an instance of an image. Now I boot minus, minus image, image ID, flavor, my VM. Happy path. Two minutes later, you've got a VM that has an IP. You can access it to it. Things are good. When things go wrong, the user says, is the cloud broken? And I get anxious. I just log into each hypervisor or scheduler logs and all that to figure out, oh, no, no, it's not broken. You did something wrong. But that's fine when you have five developers using your cloud. But when you have 50 or 5,000, things change. Two things, you have to find out from that question if this answer. You want to get very fast, because that controls your time to recovery. So let me actually dive into that specific use case. That's just one use case, most common use case. So when you create a VM, as you know, let's say you go through Horizon. Request goes to Nova API. Then the scheduler got the request through RPC. Let me see if I can get the mouse pointer. Yeah, I think it's OK. It's OK. It's coming. OK. Scheduler gets a request. Scheduler finds out the host. It does network selection, because we do network selection based on the classic service. The request goes to the Nova compute. You get support, which in our case, it talks to the MVP controller. We use MVP for the virtualization of the network layer. And the next step is Nova compute gets the glance image from the glance. You set up to KVM, boots of VM, goes to the cloud in it. Everything else happens. Now, there are many points of failure in this. For example, RabbitMQ might choke. Just 10 days back, we had an outage that starts at 8 in the morning evening. And we don't know why provisioning is failing. So after a while, we figure out, oh, it's RabbitMQ. One of the notes is dropping messages. We have to do something about it. Capacities choose. Oftentimes, scheduler dumps a warning in a database table, calling instance false. And because there's no capacity, hypervisor scanout, or network port scanout, our quantum plug-in failed. Glance is down, or it timed out for some reason. In fact, we had one case where without thinking much, we deployed a load balancer behind in front of Glance APIs. It was choking. It was throttling our traffic for Glance images. So it was failing and provisioning was failing. Now, the thing is that at that point, when it hands off, Nova hands off to LibVert, it says VMState is active. You look at the dashboard or Nova show, it says green. Awesome. But user is a psychont login. Because some things happen after that. Like, for example, it might not have gotten DHCP. Because again, there was a delay between, let's say, quantum and DHCP agent. Or there was a failure, some other failure there. Or the metadata timed out. Because cloud rate goes through metadata API to download, let's say, the SSH keys. So you may actually get a brick from Nova. So in the at least stages last year, we were actually literally logging into each machine to figure out what was going on. And we had to change that. So the way we approached is that very straightforward, each component in OpenStack does a good job of logging, except that they all log in different ways. They're inconsistent. The way things are logged. And we use what is called as LogStash, which is an extremely well-written, flexible logging, log processing framework. We use that framework to transport log messages into Elasticsearch, where we then use Kibana. Kibana is a dashboard built for LogStash on Elasticsearch. We use that to basically search for logs. Now, three things we have to do here. Two things we have to do here. First is the Glock patterns. So LogStash is just what is called a Glock pattern that you can inject. Let's say I want to process Nova API logs. And you know the structure of what the API log looks like. And then based on that, you write a pattern to extract different fields, like log levels, time stamps, messages, and whatnot. And they end up in the Elasticsearch cluster. The beauty of that is that it's horizontally scalable. Right now, we process a few terabytes of log messages every month. And usually, it takes about less than five seconds for a log message to appear in Kibana from the time it was logged. So if I say Nova Show, in five seconds, I have the trace of that request in Kibana. So the outcome was that if I know the request ID of the command, let's say Nova Boot, I can trace where it went through, whether it from Nova API to scheduler to hypervisors. Awesome. We spent another sprint a few months back to actually process more data out of logs. So we started collecting response codes, latencies from all the APIs. Now we can see, I think the picture is too small here, but these are response codes and latencies for one of the APIs. Now from VCO or VIT, I know if the latency is increased or response error codes increased, I can track that. More of that. Now we are able to get all this data. This is great. But often that's not actually what I want. What I want is metrics. Because logs are great. When I know something is wrong, I can go back and debug. But I want to see metrics as an operator of cloud. I want to do monitoring so that I can set thresholds and get alerted when things go wrong. I mean, of course, it's a pager if you remember. It does its job. I mean, we can react to errors. So what we did was that we took what we built for log aggregation and introduced StatsD, which is a small node program open source by 8C a couple of years ago. What it does is it actually starts counting things. Starts counting things like error codes in the last 60 seconds. If you see higher rate of errors, we can count that. We send that to Graphite. Graphite for as a time series store. And Xabix, we used Xabix for alerting. We started with Nagios. We were not happy with the experience, mainly because we didn't find it as programmable as Xabix was. So we switched to Xabix and then expanded that to all the infrastructure for all open fact nodes as well as MVP infrastructure. We start even the block storage. We aggregate that into Xabix. So we have lots of lots of data. In fact, we have too much data now. We have too many graphs and too many alerts that can actually flood us. Again, we still have an issue. If the user says he's the cloud broken, we got better. But still, the answer was a maybe. We were always not sure if the cloud is working the way it's supposed to work. So then we had to go back and do some more work. So we looked at other sources of information. So if you look at database, let's say a scheduler failed. And database has a table called instance false in Nova that has a complete exception stack trace. That's useful. So if you look at it, if you can count those exceptions, if there's a higher rate, you know there's a problem. So we wrote something called a stack matrix, which is a pretty dumb Python program, which periodically rates up, reads all the database tables, counts things like how many users are there, how many projects are there, what's the VM count, how many VMs are in non-active state, or, for example, in Nova there are instance task states, like creating, scheduling, deleting. And you can actually see if an instance is in a task state longer than maybe a minute, there is an issue. It's stuck. So we start counting all those things and publish two graphite and Xabix. There's one more thing. We actually get some false positives out of this model. Because sometimes, let's say there's no alert. Maybe because nobody's using the cloud at the time. It's a weekend. It's down. Nobody's using it. But maybe there was a problem. Or false positives because the user has a mistake. Let's say somebody who started a new image that has a bug in it on all projects is failing. And to avoid those false positives, we wrote something called a stack watch, which is a small bot that looks at all the KPIs that we want to track as a cloud operator, like provisioning VMs, or creating volumes, ping success rates, ping latencies. We take that and simulate those user actions. Those are controlled actions so we can measure the outcome. We do that, and then again publish data to graphite and Xabix. With that, we are able to say, if there's an incident that cloud is not responding, provisioning is failing, we pretty much know before users know it. We get the logs. We get the alerts. But in addition to that, we use the same data to actually present that information to end users. So they can actually answer that question on their own. And again, it's like a wall of fame or shame, depending on if your numbers are good on that status board on the right side, it's wall of fame. They're bad, it's wall of shame for us to improve that. So this journey took about two, three iterations over a period of a year. And we got into much better state as we scaled. We were able to react to incidents much faster. So that is what we did for logging. The next topic I would like to talk about is the scale out. So as JC was pointing out, we took the public cloud as a principle in our minds, even though we are operating a private cloud. And there's always a tension. For example, as we start adding hypervoices, infrastructure grows, when infrastructure goes, things start to behave differently. Like rabbit mqs or network gateways, things like that happen. So we have to find a way to split the cloud into multiple clouds. But if you have more than one cloud, you don't have a cloud, you have a problem. Because users start to see multiple clouds. And that's not a good user experience. So there is this tension between also availability because if you have one single cloud, all x are in the same basket. If you have multiple clouds, users get to split the application across different availability zones. And so what we looked at open stack patterns, like availability zones, if anyone recalls in SX, there's something called availability zone, which is an unfortunate choice of name at that time. Probably made sense for a simple VM cloud. It no longer makes sense. Because it does not let you create availability zones like you can do, for example, on AWS. Here's one as this notion called regions, which is a way to group API endpoints into groups. And NOVA has this notion of cells that from grizzly onwards. So we tried to do a mapping exercise between what open stack provides and what we need from the cloud. And looked at AWS. The model is very straightforward. People understand that, which actually makes a good balance between availability of the application and the scale out. Because as a user, you see maybe 20 odd availability zones in a single cloud. You still get that single cloud experience, but are able to spread your application across availability zones. So in open stack, it became a bit tricky initially and after some analysis, we actually found out what's the right pattern for business. So on the right, you see all the data plan components, hypervisor, block storage, network drivers, and storage. And on the right side, you see the, on the left side, you see the control plan for that. There are two things that we had to consider. One is that in open stack, unfortunately, the line between the control plan and data plan is not always clear, depending on your architecture. For example, if I take my neutron or quantum API down, is the cloud working? If you're using DHCP, probably not. Leases time out, and leases don't get renewed, and your cloud is down. So that line is not very clear. Same thing for Swift. Of course, Swift is understood. It's in the data path. It has to be available. So there are these dependencies. Also, the other thing we considered was that if I am creating infrastructure in an availability zone, I need to know which availability zone I'm creating that infrastructure in. So there's a strong coupling between the API endpoints and the data plan infrastructure behind those APIs. So we sort of do a boundary to define what an availability zone for us means. This is not know-why availability zone. This is our model of availability zone. Again, very similar to what Amazon does on AWS. So we draw a line around the APIs and the data plan. Very simple model, easy to explain users. And we map the two, the concepts that exist in OpenStack. So that's how our model looks like. So we have three regions. Again, that region here is not what you see in Keystone. Our definition of region is an independent deployment in a data-centered area, which may have multiple buildings multiple spread apart, maybe a few millisecond latency. And the choice we make is that each region is fully independent. If there is a failure in that region, it's totally isolated from other regions. Then further, we break down each region into multiple availability zones. That includes all the availability zone services and the data path for that behind those APIs. Again, that's mapped to in Keystone like a region. So we have to do sort of mental mapping between what exists in OpenStack and what we need to arrive at a multi-region experience for our customers, making it trade off between availability and user experience. So the final topic is the built-in deployment. So this is basically a journey of what went through, what happened when we started the cloud. Initially, we started Ubuntu, SX, do the APT install. Using Fabric, we sort of set up the cloud. 50 hypervisors, awesome. It was painless. But as we grew, in terms of scale, as well as the team size, that started falling apart. We then invested in puppet infrastructure and foreman. And we are able to create pet test clouds. That means each we have few test clouds where we can roll out changes and test them. But what we found was that relying on a public repository like APT was not good enough for us because things break and we have to make fixes to it. Sometimes we have to pull down changes from the next version of OpenStack like we had to pull changes from Grizzly to Falsum. Same thing we did last week. To quantum, we had to pull down some changes from Trunk. So that model started breaking down. And the last design summit in Portland, we talked to Rackspace folks. And we actually found out what they were doing, sort of learning from their experience and expanded what we were doing. So we invested in three things. A bare metal provisioning layer for the entire topology from hypervisors to controllers to MVP infrastructure. Then we dished APT packages and went to virtual environments, Python virtual environments, which to create completely self-contained car walls and use that as a deployment unit. Still using foreman and puppet. So today, we are able to bring up a test cloud in a few hours and to complete working cloud so we can test things in that cloud. So from pet clouds, we went to cattle clouds. And that also enabled us to do patches and partial upgrades. So we don't have complete, we don't always stick to the same release of open stack. We sometimes mix and match based on features that exist in certain components. We still have some problems in what we are doing. One is puppet, as we know, is change orchestration is hard. With scale, oftentimes we want to roll out changes to a subset of hypervisors or a subset of controller nodes or one part of the infrastructure. And with puppet, it's not working out very well. So we are looking at other alternatives in addition to puppet to work with Salt Stack or something else to orchestrate the change. Again, foreman UI, it worked out well when we were a small team. But doing click ops, what I mean is that you go to the UI to make state changes, conflict changes, that doesn't seem right anymore. So we are looking at, again, alternatives on how to solve that. So in this process, what we end up building was a full orchestration, topology orchestration that works out of the box. So we are working on open sourcing that complete automation, that includes the logging, aggregation, log aggregation, elastic search, Xabix alerts, and all the tools that go with it. And we are working on getting it out. Xabix templates, we worked with our colleagues at PayPal to actually build a good set of Xabix templates to monitor everything from hypervisors to MVP infrastructure to solidify it that we use as a backend device. The tools that we wrote and the lot of work that went into creating VPCs, we are, JC and I, are thinking about how to make it part of the community so that we don't have to work around the core. Today we are working around the core. For example, we had to map, we had to reduce scheduling to get what we want, network selection, or managing projects, or managing projects in each domain, in each VPC. So there's some gaps that we found out. And we want to contribute on scale out patterns that what we learned from the last year. That brings to my last slide. Thank you. Any questions? Yes, please. Actually, the question was, this looks like a complicated change to OpenStack. What I want to say is that we didn't have to tweak or fork OpenStack to make this work. Thanks to the community, the extensibility models that exist in Keystone, Nova, Glance, Whiskey, things like that made it possible for us. So we still are, our core is the same as what we see in GitHub.com. It's no different, except that we introduced code to extension points, like scheduler, host selection, network selection, or Keystone project mapping, things like that. It was not hard at all. I think the concept, it took a while for us to find the mappings. So the question was, there are too many components in this. Isn't that complicated, complicating your topology? I think there is no choice, because when you're running a software, when you're running a service, you have to have the support mechanism, support tools to run it efficiently, because our commitment to business is availability for the business and efficiency. And we can't get there just with native OpenStack. And there's no out-of-the-box solution anyway. So we found that we have to build it ourselves. I think we have to leave the room now. Thanks.