 Welcome all. Good afternoon. Today, this session, we are going to talk about cloud operations at scale. I'm Satish. I'm part of a team which operates and supports OpenStack Clouds across Walmart. Hi. Good afternoon, all. I am Tom. I am working as a senior cloud engineer at Walmart Labs, working out of California. Hi, I'm Gerald Butalo, I'm a senior manager at Walmart Labs, and I manage the cloud operations team that manages how we operate all the OpenStack Clouds across Walmart. And I managed a middleware team as well. So today, this afternoon, we're going to talk about how we manage how we operate OpenStack Clouds at scale at Walmart Labs. So initially, we started. We built up a few clouds which were proof of concept or POC. They were powered by OpenStack. We then had, we use OneOps at Walmart. And OneOps is our orchestration for all our VMs. So we had OneOps spinning up our VMs and managing the lifecycle of applications in the clouds. We had analytics as well built up. We then built up the prod and non-proud clouds. And there were a few clouds. We didn't go large scale initially. It was just a few clouds to begin with. Our first customer was walmart.com. So if you go to walmart.com today, you're basically on an OpenStack powered cloud. As the walmart.com, the applications came on. The requirements increased. It went first to daily traffic, always well. Then it scaled up. And we took on holiday traffic as well. Pretty good still. Sorry, I just clicked something. Holiday came and we got over 1.5 billion page views. This was in our very first year itself. Millions of transactions. And if most of you might know with page views, there are CDN solutions that take care of the caching. But the transactions that happened, there's cart. There's checkout. There's search. There's browse. All these applications talking to each other within the cloud. It went really well. It was awesome. But with building these initial clouds, there were lessons. There was challenges that came along as well. And in our day-to-day cloud journey, we have encountered with quite a few challenges across our cloud during our cloud operation effort. We'll brief them, some of them. The limitations on bulk operations. So when we ran a rally performance test to benchmark our cloud, we have seen an issue that the boot requests were failing beyond certain threshold. So this is kind of our prerequisite when we need to scale up our cloud to accommodate or request more application workloads to get deployed into our cloud. Then the other challenges, which we faced on our OpenVswitch version, the OpenVswitch version, which we are using in our Havana clouds, were getting crashed and creating an impact to the VMs hosted in a server. So this is kind of giving a turbulence to the application VMs. And we used to get more support requests on this, get a page out to fix this. This takes a considerable operational effort to fix these problems. Then we all know that the control plane is still a favorite pet for us. We cannot consider it as a cattle, like other compute nodes. So if there is any turbulence to our network or there is any hardware failures, if it impacts your control plane host, we are going to do an issue. So there will be a momentary issue. Consider there is a case that for a RabbitMQ, there was a network glitch or a hardware failure for a control plane, which creates a RabbitMQ cluster getting into a partition state. This sort of behavior, which gives the instability to cloud in the overall message flow, or the other services get flapping, open stack services get flapping behavior. Another scenario we have seen is there was a set of queues, maybe a high number of queues, which is getting created and just left as our fund. And there's no consumers to consume those messages. This, in turn, causes slowness in message flow, which causes a chain of slowness in other open stack service requests, and the unacknowledged messages used to get increased. We have seen this behavior and just interrupting our cloud message flow and creating a overall slowness. Another issue which we want to talk about is the MySQL slave lag. So whenever we are using Havana version of OpenStack, whenever there was a network glitch or a hardware failure against the controller post, we saw the lag between the database increases. And some of the cases was ending up in a split-brain issue. And whenever we face the split-brain issue, we need to work a lot to sync them each other. Another thing which we are expecting is the hypervisor auto-disabled feature in Juneau. We are expecting whenever there was a hardware issue or a software issue in the compute nodes, we are expecting that the hypervisor will get into a disabled state whenever we do a Nova service list or something like that. But it was not working as expected. Some software failures, it was detecting, but not for the hardware failures. And another thing, when we are in Havana version, we saw the DNS mass process getting killed. And it happens multiple times. And whenever this process is getting killed, we saw the network host files are in getting in two different states, which means we are getting the duplicate IP issue. We need to do some work on that to unblock the customer. And recently, we faced another issue called close-weed connections in the Nova API container. So we saw the connections, the glance client connections are not getting closed. And the connection count is getting increased. And whenever it reaches some point, all the glance client connections are getting failed. And first it will fail in the first controller. And at some point, it will go to the next controller and it fail it. And end of the day, the Nova boot API requests are failing. These are some of the challenges which we faced along with our journey of clouds. So if you notice, we have highlighted on the slides. You'll see Juno, you'll see Havana. We were first in Havana. And last year, we upgraded to Juno. So basically, we're sharing some of the challenges that we've seen in Havana and in Juno. With the last one, the close-weed connection one, that was a hard one because what happened was when these controllers, the close-weed messages get filled up. Basically, a controller is one by one. They stop accepting any requests. And then the next thing, that particular region or that particular cloud is not getting any single request. Nothing works anymore. And all of a sudden, your customers are like, what's happening? And we just faced that one very, very recently. There was a patch for it in Juno, but we are waiting to apply it after we go back. Moving along, so when we went through these, we built the clouds. We looked at the challenges. As we all know, it's people process and technology. People in the process, once you have that in place, technology you can build. You can develop. You can buy it from the market. But you've got to have your people and process in place. So what we decided is, let's follow the Scrum model. Let's look at what's breaking. What are our alarms? What are our incidents? And we continuously kept reviewing this. So we made a priority list. We looked at what's making the loudest noise. Prioritized that, see how we can go fix it. Look for fixes within the community first. If not, we go fix it and contribute upstream. So we kept doing this. We kept sprinting, fix, review, fix, review, fix, review. And we kept doing this until we came to a state where we felt the clouds are stable. And that's when, once we found we are at a stable state, then it was good enough. And then we could move forward. Until then, we wanted to reduce the noise. At the same time, we didn't want to put any Band-Aid fixes. Because if you put in a Band-Aid fix, it's just going to come back to haunt you later on. And the problem with what happens is you'll put a fix here, something else might break. And our main aim was we get to the root cause, work hard at it, find it, fix it, and move along. What we didn't want to happen was you put the Band-Aid fix. And we didn't want our engineers waking up in the middle of the night or waking up early morning. And then everybody's coming to work all cranky. So we didn't want any of that to happen. So we really took time to look at what went wrong, fix it, and move along. And now Satishantham will talk of some of the fixes we put in. So the challenges which we faced in our clouds, first thing we'll look at it, whether is there any. The similar challenge has been faced by community. And they have applied for a fix. And if not, we'll try to raise a bug in a launch pad and pick it up ourselves. And we'll just start working on it, developing a fix, and push it back. One of the challenges which we have referred in our previous slide, bulk API operations were failing beyond a certain threshold. So we tried in, during a rally test only, we came to know that when we increased, bumped up our concurrency, we got into an issue that the failure rate started increasing. So we just started analyzing. And we haven't seen any issues that have been reported in the community. So we created a bug and assigned to us. And we are able to trace it out. It was creating a problem due to a pool size and overflow value, which is defined as a default. So we are able to tune that value and able to achieve the desired results. So this helped us in, so if we want to accommodate more application workloads, we have to develop this concurrency. We can bump it up, as per our business needs. This change helped us in achieving it. Another issue which we are getting is the disk space issues in the nodes. So when we looked into the issue, we saw the open stack service logs or the container service logs was not rotating at a proper interval and it's not getting compressed. So as normally we thought of, we went to launch part, we saw there is open bug for that. We picked that bug and we fixed it for Juno. So the fix was something like a symbol. Any container service logs can be rotated and can be compressed. It can be done for anything, any type of containers. So after that we are not getting that many number of disk space issues. And another issue which we found is we are getting high number of capacity requests every day. So we have a different team in our org called capacity planning team. So they will work with the application owner. Ideally all the application owners are putting their capacity request to the capacity planning team. They will analyze that request. They will look into the application's business positions. They will check what is the current utilization of that assembly or that application. And at last they will approve the capacity. So once that is approved, they will file a ticket to us saying allocate capacity for that application across this number of clouds. As an OpenStrike engineer, what we need to do is we need to get into all clouds and execute the CLA commands. Instead of doing that, we thought of developing a client which can loop through each regions and update the capacity. So, and also this client can be used for creating tenants and also for creating users. With this, we empowered our capacity team to execute the capacity request, allocate the capacity request along with the approval process. And we reduced number of tickets coming to us and also we reduce the wait time for the customer. And as an Ops guy, we normally work on a lot of CLA commands in our day to day life. Whenever we saw some bugs, we reported it in large part and we tried to fix few of them. In the previous slide, we have seen a challenge for close wait. We have seen this bug was already reported by a community and this was an issue with a glance client version which we use in our cloud. So we are able to pick up that code and deploy it and able to test it, it's working. And there were, so the last two order we saw is we contribute to community and we'll check the fixers that's already available in community and if those two bug, it's not filling on those two buckets, we will try to build an automation around to fix our overcome or challenges. For example, the challenge which we mentioned for RabbitMQ unacknowledged message. So this was, we built an automation through to invoke a RabbitMQ API to get this metrics and collect as analytics and plot it and see the trend, how the unacknowledged message trend or a soccer descriptor all RabbitMQ critical parameters. We have started trending it. By this, we are able to get this, before problem get aggravated, we are able to find it in the initial stage, the symptom of the issue and able to fix it before it makes a major impact to cloud. So, so far what we've seen is we built a cloud which was not really large in nature. We acknowledge the fact that, hey, you know, there are some challenges, let's go fix it, let's get our process in place, let's get our people in place. Let's get the clouds to a smooth operational level or acceptance level at least and we'll scale from there on. So once we got to this table state, we went ahead and scale and build larger clouds. We have over 20 plus clouds out. So, again, one of us is used for the orchestration. We have our data analytics in there. We built more production and more non-production clouds. There's Walmart.com, we added Sam's.com, we added Asda.com, and next week, Walmart.Canada will be going in as well. So basically, oops, I don't know what's happening. So that's four e-commerce websites into these clouds with a lot of traffic. The daily traffic, the holiday traffic, the millions of hits we're getting, and now we have over 3,500 applications into these clouds. These clouds actually took on the holiday traffic for last holiday. We had a really good availability for all the markets on these clouds. So it's pretty much proven that you can do it simply, and if you take it step by step and keep it simple, you can move application by application on the cloud and it works really well. Okay. So as we expanded, we went from a few racks to a lot of racks of gear, but what you see stayed consistent is at the bottom. The associates managing the few racks stayed the same when we scaled up. What we looked at is, how do you scale up? And we simply didn't just want to throw people at support. The big question is, how do you still operate all these clouds, production, non-production at high availability? And for us, it is absolutely important to have this high availability because we have live traffic customers from all over the world hitting our websites. Besides this, there are other internal applications also on our clouds. There are a few databases also on these clouds. So it is absolutely important for us to make sure that we are running at high efficiency. So how do we achieve this high efficiency? Very simple technique of self-healing, right? So all of us want to do self-healing. Everybody wants to get there, just like us, but then the question that comes is, what do you want to self-heal? What do you want to automate? Do you want to automate NOAA services? Do you want to automate Neutron, RabbitMQ, the hypervisors, the VMs? So the other question that comes is, how do you want to do it? Do you want to go like simple and easy, or do you want to go sophisticated? And the bigger question is, oh, what do you go after first? These are all equally important components, right? So how do you prioritize and what do you go after for the first take on it, right? So our self-healing philosophy is very simple. So we all do monitoring, and we all get notifications from that. So what we did here is, instead of getting a notification for us, we defined a set of workflows, and we put whatever script, whatever things we are going to do there to fix it for the non-issues. So let's take an example. We have a monitoring called process monitoring for Nova compute process. So the monitoring will check whether the process is running or not. So if the process is not running, the monitoring will trigger an alert, and what we did is, once that alert is triggered, we tried, we defined a workflow. We restarted that service and we checked the Nova service list and see whether the service status is showing us up and enabled and also check the process. It looks good. And that's where we defined our self-healing features. We never stopped at this point. The process restart, the issue is not fixed. So we just go ahead and take this as our filler time or give some room for improvement, and we'll just work on back. So what needs to be fixed? So for this, we have a scheduled set of analytics which do and which automation will collect the metrics and will give us a report. How frequent this issue happening? All this we'll get as a daily and weekly report and how critical it is. We can plot it based on the alerts, how many we are getting for it and get it self-healed also. So this kind of monitoring is enabling us to see what type of issues. We have set up monitoring or automations which can give all the critical, open stack critical data whatever we need it to make our open stack service stable. So what we went after the key is we started graphing these and started seeing the sequence or we started seeing the data, analyzing the data. What's the thresholds? How many spikes, how many dips, right? You want to see the peaks and valleys based on this you can start prioritizing what you want to go after first, whether you need to automate it, whether you need to fix it, whether there's a legit issue over there. Once we started graphing data to what we thought was personally important to us, that enabled us to build up a solid priority list. And then we went after simple techniques like using event triggers from many of the monitoring solutions that you have out there today in the market. Those event triggers basically helped us save a lot of time on simple issues of NOAA services, neutron services and so forth. Then after we got those things out of the way, we decided okay, how far do we want to go? Do you want to create workflows? Because there's a lot of software out there where you can use virtual create workflows, create events, create web hooks, and you can start using multiple ways of self feeling, look for a service if it's down, look for service B, run a workflow, then go around, then I can go back to part A or go to the top and start that service. So we are looking into those as well right now, but in short, you can operate the large scale clouds just by keeping it simple. And that's what we're trying to show out over here. So one ops at scale within Walmart, right? So we use one ops, which is used for orchestrating the VMs on all our clouds. So in the initial slides, we spoke about monitoring, self feeling about all the open stock components, but now I'm assuming everybody's thinking, hey, but there's VMs and all the other stuff inside as well, we'll get to some of it later on because one ops helps us take care of the self feeling over there. So with one ops today, which is, which helps enables orchestration of our clouds and application lifecycle management in our clouds, one ops has over 4,000 applications today, over 4,000 associates across Walmart are using one ops, over 40,000 deployments in 30 days, that number keeps going up and down. We have over 150,000 cores, which is one ops helps automating those 150,000 cores because one ops basically spins the VM, the user logs into one ops, he goes and he chooses the size of the flavor, which is a large, medium, small, two core, four core, depending what memory he needs, those flavors are set, and he can spin up the VMs. He doesn't have to log into open stack or hit horizon. He just goes from one ops and he can spin up the VMs on there. One ops can be connected, or one ops can spin up VMs in Azure, Rackspace and of course our clouds. The 50 circuits that you see on the right-hand side bottom is basically the packs or the assemblies, which are already available within one ops and one ops is open source at oneops.com. So if you download one ops and you want to orchestrate with your cloud, these apps are already ready. The circuits are ready. You can go install your app. All you need to do is basically get your code and you can get rolling. So we're going to talk a little bit about the design of one ops. As Gerald said, one ops is an orchestrator which manages the VM lifecycle hosted in any clouds. One ops enables us continuous lifecycle management of the application workloads and it is a multi-cloud orchestrator. We can run one ops against any cloud, open stack cloud, Azure cloud or whatever kind of clouds. And it manages our applications, design, development, deployment, operations and monitoring. And one ops provides a self-service portal for the administrator where the application owner can administrate that application, monitor their application and get notification whenever they are getting some alerts and they can act according to that. So it's a storm-sitter, it's a self-service portal. So where we'll configure our cloud? So there will be an account where we'll set up a cloud provider. The cloud provider can have multiple services. The cloud, what our service required for load balancer or DNS or it's going to be our compute, storage. All this will set up in the account. So on the account means there is an org. This organization is an analogy to our open stack tenant or project. So one of the streamlines three process, I can tell three phases of lifecycle. It's like design, transition and operation. Design, where we define an architecture based on our application requirements. So for example, if you take a Tomcat, Tomcat requires a set of softwares before you install a Tomcat. Java, you need it, and user. Then you need a, basically volume if you want to create a mount. And then all these set of softwares you can just consider as a pack or an up-service pack. That's what you will be getting a list of packs, list of predefined packs available at OneOps and you can pick it up for our design. Where we can visually assemble our application. Where we can see our design. How our application is going to get deployed in a VM. Then it comes for the transition. The transition is a place where we can realize our design. So we want to deploy to what cloud it needs to be deployed or, and we can use the same design to deploy it in multiple environments. By customizing some of the things. Maybe a heap config or it can be a VM in size, flavors. Those things can be customized across different environments. So transition also gives an ability for us to select an availability. Whether this needs to be required to be deployed as a single VM per cloud or redundant multiple VMs per cloud. Then operation. The operation is a place where we monitor and do some perform some actions on our deployed applications VMs. It's, if you want to do a restart or stop or start of an application, compute all these can be done at the operations section. And the operation, there is also, it gives them more and more ability, like gives a feature, which is jelling with OpenStack and giving us an ability to how we do operations that perform operations at scale. This has some important features, which Tom will get. In one of the interesting feature which one of these provides an OpenStack engineer is a self-healing feature of the compute components. So it's called auto repair or auto replace functionality. So what this auto repair and auto replace. So let's take in my OpenStack environment, a one of the hardware failed and all the VMs in that is not in a healthy state. So in this case, what one ops will do is it's having a set of probing, which will, so that one ops will know this instance is unhealthy note. So immediately one ops will take out that instance from the live traffic and it will trigger a set of procedures. For this case, it will trigger a auto repair procedure, which internally it will try to restart the application inside that VM. If that is not getting healed, it will reboot that VM. After a set of auto repair procedure, if it is not getting healed, it will replace that VM. Which means it will take out that VM from that failed node and host it in a new physical node. And with this, we don't need to do any immediate action on any failed hardware. And we are getting some time for looking into that failed hardware and fix it permanently instead of just resetting that machine and bringing the hypervisor back. So as Tom mentioned, the VM's life cycle, the auto repair, auto scale is taken care from one ops. So while we look at self-feeling of the open stack side of Nuva Neutron and the other components as you go down the stack, the VMs which matter to the application teams, which matters to the business, you can lose a rack, you can lose a bunch of nodes and one ops takes care of replacing them or repairing them, which is key to keeping the websites up, which is key to keeping the applications up, which is key to even scale. So the percentage of failure decreases because you spread these VMs across various regions of various clouds, you can lose the entire cloud and you'll still be fine because you're running on another three regions or four regions depending on how you have spread your VMs or how you have picked up your data centers and spread your application across. So this enables us to take care of lot of the hardware or any other failures if we run into any other issues and it allows us to even go and do a deep triage without disrupting any application or any traffic on the websites or any customer having any impact. So it also has a policy enforced, so you can have a, it's safe, it has all the security policies in it and you can, as mentioned earlier, you can deploy to multi-clouds. So with that, we are at the end of the presentation. Any questions? We have been told you have to walk up to the mic for any questions or you can even contact us via email. Hi, thanks for the presentation. I'm Neil from Cloud Engineering at Box and I was wondering if you could get into a little bit of detail about the people part of the journey from a DevOps perspective. How did the cloud migration and the operations aspect of it take DevOps into account? Sure, so that was a long journey for us. It takes a lot of workshops, a lot of understanding between the engineering teams, between the cloud teams, between the development teams. Basically, once you migrate your application into the cloud, you become self-serviced or self-enabled. If you go backwards, when you're on bare metal application developer develops his application, puts it on the bare metal and you may have had a different team that supports it and then you have multiple teams or systems, network, all that stuff, right, database and so forth. Once you go to the cloud, for us, we use one ops. So the application developer basically develops his app, makes the pack. If it's not there, if it's there, he leverages an existing circuit or a pack, pushes his code in, pushes it to dev stage, QA production, performance test it. Once it's in production, they are the dev team. So we went to the model, you build it, you own it. And that's how the teams develop it. You are a perfect dev option where you're developing, you're operating and you're managing your app. Initially, it took time, but over a period of time where one understood, hey, you're in self-service mode, you own your app, you're the owner, you get to see everything. You see a CPU, your network stats, your RX, TX, packet drops, everything. So that took a little bit of a culture change, but it's now very well adopted and people are doing it. I have a follow-up question, Matt. So you mentioned you build it and you own it model, which is super cool, something we're looking into as well. And I'm wondering if you ran into any challenges around doing that in a highly compliance-oriented, regulated industries out there, where you're always sort of running up against what will not be considered compliant with PCI and HIPAA and FedRAMP and all those certifications, as far as giving access to developers on doing these highly privileged actions on your cloud infrastructure. Yeah, so initially, we were in this phase of PCI and non-PCI, so we have HIPAA, PCI, SOCs, all that stuff. And the InfoSec team worked very closely with us. There were, I think, like 200 plus fixes somewhere around that neighborhood of that number. I don't have the accurate number, but there were somewhere in that neighborhood of 200 plus fixes that we had to do literally to be compliant with our InfoSec team. And till we didn't do that, the websites were not allowed to take live traffic, so we had not enabled those markets to take any live traffic that our customers hitting it. So we had to basically be compliant with all the InfoSec or all the security regulations that were there. Cool, thanks. All right, if no more questions, feel free to hit us up on email if anything comes up. Thanks.