 OK. Good afternoon, everyone. My name is Trini Alexander. I'll be talking about powering up eBay OpenStack Cloud and what goes on under the hood. So I've been with eBay for about seven months, and there's a lot of learning and the massive scale of eBay. It's absolutely incredible. So let me quickly walk you through the agenda. So I want to talk about cloud architecture. How do you build for scale? What are the things that goes under the cloud lifecycle management? How do we bootstrap a cloud? How do we add capacity? And also, we want to talk about some of the future directions that we are thinking. So before I begin, I would want to talk about eBay PayPal, the business and the demanding business. So if you look at eBay, we have about 157 million active users globally. PayPal, it's presented in 203 markets worldwide. So both have a lot of transactions happening, which requires a highly scalable dynamic cloud. So the eBay OpenStack, eBay PayPal OpenStack Cloud, roughly looks like this. So in its entirety, we have over 12,000 hypervisors, and that keeps growing on a daily basis. We have 300,000 virtual cores, 70,000 VMs, over 1.6 petabytes of provision storage. We have more than 10 AZs, 15 VPCs. And we are currently on version Havana, but we're in the process of upgrading to Juno. All right, so I wanted to talk about the cloud architecture a little bit, some of the fundamental design principles. So the cloud in eBay, we have different regions. Each region has different AZs. So we plan for failure. We want to distribute the workloads. It's not all in one region. So if you want a horizontal scale, for example, we want to add more regions, more AZs. So that's the fundamental principle. And the key thing is around planning for failure. Things could go wrong, either at the AZ level or at the region level. Or between regions, there could be loss of connectivity or between AZs. So we have plan for failure and the shared nothing model. So most of these services are in the shared nothing, but there are very select few services. Like Keystone, for example, we have a shared model. All right, so how do we look at the cloud lifecycle management? So there's a lot of moving pieces under the hood. So fundamentally, we have grouped it into these four big functions. So we have provision, deployment, monitor, and remediate. So provision is basically doing a bootstrap of the cloud, setting up your region's AZs, deploying all the open stack services. And on top of that, the tenant's workloads. The deployment is basically upgrading your control plane, doing the software rollouts, config changes, et cetera. And of course, if you have a cloud that's running at scale, you want to monitor it and listen for help. And if it's deteriorating, we want to do the remediation almost in a real time. So if I expand these four bubbles, like I said, the first one is the cloud buildout. The second one is doing the ongoing software deployments, monitoring, and remediation. So what I'd like to do is go through the pre-step of building a cloud. How do you bootstrap a cloud? So we have a process called rack onboarding. So there's a lot of pre-work that goes into building a cloud, so there's a lot of plans in terms of what data center, what AZ, what network bubble, what rack location, et cetera. These are all planned out. And the gear lands into the on-premise, gets wired up and powered on. At that point, there's a whole bunch of automation that happens to get that rack onboarded nicely into the system. So it all flows from a ticketing mechanism. So there's a bootstrap ticket that gets filed with all the rack profile, the assets, skew information, et cetera. And when the rack is powered on, all the information is uploaded. There's a pixie boot that happens. And there's a custom image that we download onto all the assets, which does a scan of the system, collects all the infrastructure, the asset level information, uploads it into a central repository. We call it a payload. So that's basically what it really has. And in the bootstrap ticket, we have all the rack profile, assets, skew information. So we do a cross-check with that to make sure it is the right skew that landed on-premise. And we onboard it correctly. And also, we do a whole bunch of checks, like we do a LOM check to make sure that the BMC is accessible. You can do out of bad management. We set up the network interfaces, the relationships between VLAN subnet, the bubble, et cetera. And also, we try to calculate the asset fall domain. And once that's done, we go and update our internal CMDB with all the asset rack model information. So at that point, we're pretty much done with the rack onboarding. But we do one final step of spinning up a test compute and making sure these are really functional. So that's the last step. And another thing I wanted to call out is we do a, we only qualify a rack as onboarded if more than 90% of the assets are working and functioning. So if it's not, we would automatically, by the way, any of these steps, but it fails, we'll file a ticket which gets assigned to a team which works with the vendor and fixes the faulty hardware. Sometimes we get faulty hardware assets trying to get onboarded. They are stuck in the queue. But the key thing is we catch it early and detect and remediate quickly. So this is a process that's completely automated once the rack is powered. It's a fully automated process. This used to take weeks to get done. Now, with the automation, we have a time savings of, you know, from weeks going to a few hours. So that's an incredible savings. The next thing I want to talk about is once you onboard the rack, all these assets, they go through a certain lifecycle. So the lifecycle, there's all the way from ordered. Ordered basically is you've ordered the asset working with the vendor to get that on premise. Once it lands on premise, it goes into something called cold cache. And once it's onboarded through the previous process, it would be in the warm cache. So it's basically ready to be consumed. And once the assets are allocated, they go, obviously go into the allocated queue. And last but not the least, there are assets that end up live. So we want to do the tech refresh. So that goes into the DCOM. So there's another state I haven't called out here. It's just called faulty. So between warm cache and DCOM, it could go into a faulty state based on multiple reasons. So we have that state as well that I haven't really called out in the workflow. So what does this get us? We could look at our entire cloud and figure out what are the assets in which state. So it gives us a view into the system. So this is a sample data I cut out. So we have all these caches between ordered cold cache, warm cache, allocated. And I have faulty here as well. So the key thing here is this gives us the insight to operate our cloud at a much more efficient level in terms of the asset utilization. If things are stuck in cold cache, then it's indicative of a problem that in our onboarding processes not efficient. If things are in the faulty state and that's expanding, that shows us the insights into you have a lot of infrastructure, but you're not efficiently utilizing it. And same thing with warm cache. If the capacity gets depleted, it's a trigger to the capacity team to onboard new capacity into the warm cache or from cold cache to warm cache. So this basically provides the operational efficiency operating cloud at a cost optimal way and reacting to faulty assets very quickly. And this gives us the data transparency and the data confidence to operate our cloud better. Next thing I want to talk about is the capacity add process. So we have racks onboarded. And we want to add that into the respective open stack services. So we have a internal tooling, again fully automated, which would do a provision of the hypervisor to adding it to the various subsystems, like registering with Foreman, NOVA, doing the network onboarding, and doing a test of the newly provisioned hypervisor, basically by doing a test VM provision and also trying to check for IP pings and reachability, et cetera. So once all those various steps are passed, it's when we would do the enable hypervisor. So the key thing there is we don't want to enable it until it's fully onboarded. Because otherwise you will have a half onboarded hypervisor which would take workloads and it gets into a messy state. So this, again, is a fully automated process, which we used to take days to do this. Now it's within a few minutes. So there's a reverse process as well, off-boarding. So sometimes assets, hypervisors, drift. And they get into a state where we want to remove them from the system. So the previous process, if you kind of reverse that, you get into the off-boarding process. The same thing, fully automated process operates within a few minutes. So that's a bit of the cloud bootstrapping. So I want to jump into things that we do on an ongoing basis, which is capacity management. So we have a lot of VMs that users create, spin up. So any person with an eBay who has a badge basically can spin up a VM and do their stuff. And the key thing for us is we want to make sure it's not regulated from probably the wrong word. But we want to make sure it is effectively utilized all the VMs. So what we do is we look at all the VMs utilization. And if it's underutilized based on certain triggers, we would create some leads. And with the leads generated, we would work with the VM owners to say, hey, this is underutilized. Would you like to send them an email? Would you like to delete it? Or extend the tenure? Or do nothing? If it's a do nothing or delete, of course, the delete will delete the VM. But if they do nothing within 14 days, we would shut down the VM and give a gray spirit about seven more days before we delete the VM in case some important work is lost because of the employee's on vacation or whatever. And by the way, when we send the email, we send it to the VM owner as well as the direct supervisor of the VM owner just in case if folks are out. So this is an ongoing process, fully automated. You can see that that's pretty much a theme here. And till date, we have reclaimed about 15,000 VMs. That's a huge deal. So this is basically looking at your usage and you're trying to run your cloud at optimal capacity and also the best efficiency. All right, so the next thing I want to talk about is how do we monitor cloud? So if you look at the four bubbles said earlier, one of the things is monitoring. So we use Zabix internally to monitor our OpenStack services. So we monitor hypervisors, OpenStack services, and also Zabix per AZ. So we want to make sure if the Zabix service on one AZ goes off, it is detected and remediated. So there's a cross AZ Zabix monitoring as well. And once the monitoring detects an issue or an incident, there's a capability to alert. So we have various levels of alert. There's a page of duty for high-level, high-critical alerts, basically what wakes up somebody in the cloud upstream, or email if it's, say, not critical but high alert, and if it's not a crucial alert or a medium size alert, it files a gearticket. So why do we need alerting? It's pretty obvious. We want to monitor the health of the cloud and be humming along very nicely. If there are issues, we want to detect it very quickly and react to it fast. And one of the things that I want to call out is more of a, from our experience, we went a little crazy with creating all these alerts. And a lot of them were not really actionable. So you want to be very careful when you build out alerts. You want to make sure each alert has it's actionable and meaningful. And if you have an alert, you should have an equivalent runbook that lists all the things that you do when that alert happens. And if you don't really have a runbook, you really have to question what is the purpose of that alert. And one of the things that we did around the geartickets is all the lower-medium alerts, we just file a gearticket. It doesn't even surface on email or doesn't surface on anybody's page or duty. So it's containing the volume very nicely. Zabix Architectural, I'm sure it's a ton of material out there on the Zabix site. But this is just want to call out, there's Zabix Agents, Zabix Server, Zabix Database, and Zabix Frontend. So agents are installed on all the hypervisors that collects information from various hosts and feeds it into the Zabix Database. Zabix Database monitors the operating range. If the signal falls off that operating range, it'll go from OK to problem state or problem state to OK state based on the good or bad. And there's an alert or service that would do the necessary action. It could be a gearticket, it could be an email, it could be sending out a page or duty alert. All right, so moving on. Another thing we do is we collect, we have a lot of information across our AZs, across various systems. So we have Zabix, for example, that collects the metrics from various hypervisors. We have all the Nova, Cinder, Keystone databases. We have our own internal CMDB and the user databases that we have. And we could pull all the information from that and create analytics on top of it. So what we've done is we think of it as an ELK cluster. So we pull all the information into an elastic search cluster from all the various AZs into one central repo. And from there, we can cut different dashboards for investigating things, or we have dashboards that we watch on a regular basis. Like for example, there's an alert dashboard or a compute provision dashboard. So we have all these things, capabilities. And the power of this is you can actually look at an alert incident and pretty much drill down to, sometimes you have a stream of alerts coming in, you go to the alert dashboard, you can actually find out which region it's coming from, which AZ, which hypervisor to the VM level. So that's pretty powerful in terms of quickly reacting to some incident and getting to the bottom of it. Then the next bit I want to cover is the cloud infrastructure remediation. So this is the bit where infrastructure goes bad, like assets, hypervisors. So how do you react to it and how do you handle that? So with eBay Scalettes, you have to have automation in place to manage this effectively. So what we've put together is a remediation software that basically has a whole bunch of automated remediation jobs that run on alerts. And there's a ops console where the cloud operations team can go in and do all the things that they can do directly with the hypervisor. They can do it from the tooling itself. For example, they can see the asset based on the asset ID or the FQDN. They can click on a cog that does reboot hypervisor. All those various actions can be performed. And also, there's reports on all the jobs that are being performed. There's a lot of nightly or daily jobs that get executed. So you can see the status of that in a nice report console. And there's another thing that we recently added. So oftentimes, we get into the situation of migrating workloads. So migrating workloads could be because your hypervisor is running too hot, and you want to move certain workloads off of that, or you're dealing with Tech Refresh, or there are some specific use cases that we had to recently deal with within eBay. So we've been doing this a lot. So what we did is we automated that whole process of migrating workloads. So basically, flexing up new VMs onto new hypervisors and flexing down the ones that are running on the source hypervisor. So that helps. Like, for example, if you're running a hot hypervisor, when you migrate workloads, it cools down the running hot hypervisors. And we do that primarily because we want to be DR-ready. So we want all the hypervisors to be operating at a certain range and not above that. So hypervisors go down. How do you handle that? So we want to do a near real-time detection. And we want to do a near real-time remediation when hypervisors go down. So we have the monitoring system. We check against signals. And if the hypervisor is not reachable, we need to flag the remediation engine to go remediate the hypervisor down issue. But we want to be careful here because if there's a site-wide incident like a network glitch or something, you don't want to react to these signals. So we have a quiet period before we alert the remediation engine to go do the hypervisor down remediation. So what do we do as part of the remediation? The first thing we want to do is find out all the VMs that's running on the bad hypervisor and notify the owners of the VM saying that your VM went down because of the hypervisor down issue. Here's the best practice in terms of setting up your high availability VM cluster. And if it's production applications, we already have automation which would spin up VMs without even the application owner knowing it. So basically, they will see a dip in the capacity, but before they know it, it's replenished. And there are some things that we use to collect diagnostics from all the hypervisor down cases. We want to collect the diagnostics such that we can learn from patterns. And later on, we can do a predictive eviction of VMs from those hypervisors. So if you see a certain pattern that is leading to a hypervisor downstate, we can look at that and say, let's notify the VM owners and ask them to evict before the hypervisors have issues. So this is another fully automated process. The key KPI here is to reduce the time to detect and time to remediate. So we were doing this over days. We would take about three or so days to detect a hypervisor down issue and remediate. Now it's in real time. And it could be real time if we didn't have the cooling period and we opted to have that sort of the quiet periods, so that we don't accidentally react to false alarms. So the next thing I want to talk about is the remediation workflows. So all the automated jobs in our remediation engine, on the left side, you have sort of the various life cycles of various stages of the hypervisor. So it's getting the assets bootstrapped to provisioned to operating. And you're doing an ongoing ping. You're checking the hardware, monitoring. You're doing decom, or sometimes even the operator looks at it and manually flips it to a bad state. So the thumbs down basically shows the hypervisor that's in a bad state or asset is in a bad state. So that goes through a auto remediation. So we have various pools in terms of there are some assets that we fully operate and remediate. There are certain pools which are exclusive to a specific group. So they want to manage their own assets. So for them, we act as a pass through and notify the folks. And the idea is if there's a faulty asset, we would do the VM migration. We call it the faulty migrate. And get all the VMs out of the hypervisor. And get it to a healthy state. So the thumbs up basically brings it to the job, the remediation job succeeded. And there are times where not always all the hypervisor issues are remediatable. So there are times where we would have to work with the vendor and file an automated and get that rectified and taking care of. So this whole thing, again, it's a lot of steps. It's a lot of passing the bug from one team to another. So automating this has definitely helped in terms of getting the faulty assets under 1.5%. So we used to run at a greater than 5% faulty assets in the system. And getting that down to 1.5% or less is actually much lesser than that today. It's gotten to a much better utilization of your assets and huge cost savings. The next thing I wanted to touch upon is, we have all these tooling automation, but are we done? What's our future direction? So we want to look at, we have a ton more work to do. We have automated a lot, but there's a lot more work that needs to be done. So we want to get to the Cloud Lifecycle Management almost zero touch. So we should be able to run the Cloud fully automated, a self-healing Cloud, if there's an issue with the control plane, sends a signal to the remediation engine, the remediation engine kicks in, repairs it, and brings it back to the normal state. So all those things need to be built. So there's a lot of bits that we are still missing. And another thing we're looking at is containerizing the open stack services and looking at not just containerizing it, but also managing the containers, clusters, very evaluating Kubernetes at this point. And we're looking forward to building the next generation Cloud. So there's a lot of work. There's a lot of challenges. You're working with the cutting edge technologies like Kubernetes and all. You're looking at a fully automated Cloud operation. So we're definitely excited in this journey. And there's a lot more things that need to be built. So if there's a builder in you who's excited about the scale, feel free to reach out to us. We're looking for great people. Thank you very much. I'll take questions. So there's a mic up there. If you don't mind walking up to the mic. I have two questions. So by looking at this model, what are the capabilities that your tenants actually get out of the platform? Because as a service provider, you are actually taking all the actions when the failure happens. What kind of capabilities are tenants that are just hosting their workloads on this platform? They're getting from this platform. What is the company and drivers for them to migrate to this platform? So it's the eBay Cloud. We have Elasticity. That's the key thing with the Cloud model. And Elasticity provided at this provider level or at the tenant level. So we provide, you can spin up a compute. You can spin up a VM. And you can do your test project. And you can fold them once you're done. So with the Cloud model, it enables you that. So the value to the tenant would be to easily, without going through 10 levels of infrastructure procurement, you can directly go and spin up an instance, do your testing. Do they have to read their code to do that? That's what I asked. So the major problem that I see from my perspective is, if I have to migrate my listing applications onto the OpenStack to use this capabilities or APIs to be very intelligent there, so they may have to read their codes to call those APIs to do that. Or I can provide it from the service provider level. So like I said, you're running in a Cloud. You can do whatever is needed for your application. That's all on the tenant. What we do is we enable the Cloud operations. So if you want to spin up a VM, so we use OpenStack clearly. And we use that to use no sender as the block storage and give you all the services, the front end of the front office services to get you enabled. My second question is about what kind of capabilities that you provide to your tenants so that they know their health of their infrastructure. At the same time, health of their lifecycle, health of their security, do you guys provide any kind of capabilities from the Cloud for your tenants? So we're looking at building services that's in the works. So we're looking at creating a telemetry equivalent. So we can look at all the stats from the VM and we can subscribe to certain things and you can get alerted based on. Again, that's on the data plane side. The monitoring that I described here was on the control plane, the infrastructure, sort of the layer between the hardware and the OpenStack services. Thank you. You're welcome. Thank you. Could you talk about the automation tools you use? One question there, one is, can you confirm that you look like you send everything to one single elastic search cluster for all of your infrastructure? That sounds like amazingly large. And what those indexers are? So it is a single elastic search cluster. So a lot of it is homegrown. And that's where I think we can partner with the community and make that better. I'm sure a lot of folks are running through the same issues. So let's solve these problems together. So I would love to have more participation there. Hi, two questions. First one's hopefully easy. You have the cool off period to avoid systems being shut down if basically you get a false positive. Do you have the means to go in and stop that from happening if the problems escalated beyond? Yes. Yes, I forgot to mention that. Thanks for asking. So what we have is even if it's not a site-wide incident and it's a lot of hypervisor going down, we rarely have that issue. So if we basically have a threshold, if it crosses more than X number, then we would flag it and stop that hypervisor down remediation process. And we want to be extra cautious because you react to a false alarm, it could. Cascade. Second question is, I know eBay spends a lot of time engineering resources on heat and power management and conservation of both of those. Do you do any work with that team to be able to basically spin up and spin down servers or to better consolidate services on to hypervisors that are underutilized? Have you gone to the full step of automating the system towards those energy and heat conservation goals? Not right now, but that's a great opportunity that we can look into. OK, thank you. Just one question. Could you share a bit of detail on how you made your decisions on storage-centric areas, like what kind of storage you used, and you mentioned Cinder, any of the services you use, and so on? That's slightly out of the topic. But maybe we can chat offline? Sure. Hi. You mentioned in the beginning of the presentation about you talk about regions and AZs. So did you mean OpenStack region and OpenStack AZs or something else? So let's say it's a cloud concept. So you want to have regions. You want to have multiple regions. Regions could be like US West, US East. And within each region, you could have multiple AZs which are having separate. If one of the ACs gets back up, if one of the ACs get impacted, your workload's running on another AZ. So it functions because it has a separate network backhaul, separate power supply, everything. So that's the AZ region model. So that's what we're trying to describe. OpenStack has a concept of region and concept of AZ as well. The mic is not on. Amazon's Google-style model where AZ is a fairly large control plane includes computers to the network. So I understood that what you presented is not the OpenStack concept of region. Exactly. Right. Exactly. All right. Any other questions? Thank you.