 Hi, everybody. My name's Travis Newhouse. I'm the chief architect at AppFormix. At AppFormix, we build software to allow operators to better manage their private cloud. And we're happy today at the summit to announce that Rackspace will use AppFormix to help manage and scale the performance of OpenStack private clouds for their customers. Thank you. So a little more detail about what AppFormix provides. We have a data platform that's built around a distributed real-time analysis of metrics. And from that, we can generate events and provide you operational analytics over your SLAs. We provide state-driven orchestration so that your workloads can be placed properly across the infrastructure based on actual resource utilization. In addition, we give tools for operators to perform chargeback and capacity planning and also allow operators to provide a self-service environment to their customers so that their own users can define alarms around their applications and their resource usage inside of OpenStack. And all of our services and software integrate both with OpenStack and Kubernetes. Today, what I want to do is highlight a couple scenarios and walk through really a live demo of what AppFormix does for five different situations, basically. So the first one I'm going to talk about is when a user calls an operator and says, I'm experiencing slowness inside my VM or my application. And the operator needs to help the user understand and find what is the cause of that problem. So how can our tools help to troubleshoot? The next scenario I'm going to talk about is when a user is planning a new event or starting a new project, they're going to deploy a new application. And they want to, say, bring on 20 new instances to the infrastructure. The operator needs to maybe give them quota for a new project or increase quota for an existing project. But the operator first needs to know, does my infrastructure actually have capacity to support this new application? And so I'm going to talk about how the operator can actually gain insight into that data. Next is how you maximize the ROI on your infrastructure, because it's very important that you get the most utility out of what you've paid for and that you operate efficiently inside an enterprise data center. And I'm going to show you some reports that we can generate to show actual resource utilization so that you can maximize the size of VMs and your users can learn to right size their VMs to their application workloads. The fourth thing I'm going to describe is how we can improve availability through setting SLA policy and alarms so that operators know when things are operating out of bounds inside of their infrastructure. And then the last thing is, once you've defined those SLA's, how we integrate with OpenStack to actually perform workload placement automatically to continue to satisfy those SLA's on behalf of the operator. So I'm just going to jump right into the demo. This is the login screen here. We integrate with Keystone. So you use the same Keystone credentials as you use for your OpenStack environment. And right at the top, you get a snapshot of your entire infrastructure and all the elements that are defined inside of it. So from the virtual layer, we see instances on the right. We see projects that users have defined. And then on the physical side, we also see the actual physical compute nodes as well as host aggregates. All of these are discovered by our OpenStack adapter that interfaces with all of the OpenStack services and then presents this information to you in a way that you can hierarchically drill down across layers, both virtual and physical. So let's come to that situation I described. A user says, there's some kind of slowness in my application, and I need some help figuring out what's going on. They might just tell you that I have a Mongo running, and the queries are taking a little longer than the normal. The name of the project is DB Mongo. We have a search tool that you can search by project, instance name, IP address. So you can quickly say, OK, you know the name of the project. I type it in, I can find it, and I get a snapshot of what this project is defined and using inside my infrastructure. So I can see at the top these pie charts that show me the allocations that have been given out by Nova, by Storage. So we can see there are six active instances in this project out of 50. That's the quota. This is how many vCPUs are being used, the amount of memory, the amount of storage. And these are just all allocations. But we also give you actual resource real-time utilization. Now this is a snapshot of a single, these are all six instances running in the project, running across the infrastructure, and we're reporting the real-time CPU usage, real-time memory, network IO, and disk IO. We also indicate what the flavor size is and the hypervisor on which that instance is actually running. Now you'll notice the red exclamation that's drawing our attention to that there's something wrong here, right? And in this case, we've set an SLA that says, I want to know when my CPU usage is above 90%. It's triggered that event here and pushed it up all the way to the dashboard. We can clearly see that the CPU is burning up on this instance. We can then drill further down into the instance itself and look at detailed metrics, such as the CPU usage, the disk IO, read rate, the disk usage, both as a percent or the amount of storage. We can over time, we can zoom in and zoom out and look back to see maybe when did the problem start. And what's interesting is that this problem may, in this case, be inside the instance. And so the application owner might need to go look inside their instance, and the user can say, OK, what's going on inside that instance? It could have equally been the case that none of the instances in the project were actually at fault, but there's other things running on the host that are causing slowdown because there's resource contention. So we can actually navigate both from the project view as I showed in the search. But at the top, this context is telling me that this instance is running on this host named A32. I can click back to that host and actually see all of the instances that are running on that physical host, even though they're running in their parts of different projects. And that could help me identify if perhaps one project is being affected by another. And we can compare both the host CPU usage of an instance versus the CPU usage inside of that instance. So to give a comparison about how much resources are actually being used on the physical host compared to what the virtual machine thinks it has inside of itself. So that's kind of a quick overview of how we provide cross-layer visibility into the resource consumption inside your open stack environment. What I want to go to next is this idea of planning. And as an operator, you need to understand how much capacity is available inside of your infrastructure so you can make decisions about how much you want to give to your users in terms of quota. What we provide is a dashboard that shows you the used capacity broken down by flavors. So any flavor that you define inside your open stack environment, we discover that list of flavors. Each environment we've been in is different, and they all define their own custom flavors. We find those, and we identify what is the current usage across the infrastructure, as well as what is the available capacity at the present time. So at the present time, this infrastructure on which I'm demonstrating this doesn't even have the capacity to spin up a large instance or an extra large instance. All the resources have already been allocated inside the environment. And I can even see the actual oversubscription. So inside of Nova, you can define what is your tolerance for oversubscription. You might allow a 2 to 1 CPU oversubscription or some amount of memory oversubscription. This is the actual. So in this case, there's 6% of my memory has been over-allocated and oversubscribed. So when the user comes and says, we want to start this new application, we need 20 instances. We're planning to go live in three to six months. And can you create me a new project with enough quota so that I can kick off this application? Right now, I wouldn't be able to provide them 20 instances. And I could look at a trend over time and clearly see that the available capacity of my infrastructure is nearing zero. And so it might be time to buy more hardware in order to increase the capacity and meet the demand of my users. But buying new hardware is not the only way to solve the capacity problem. And we want to be cognizant of cost inside the enterprise. And in some cases, we need to maximize the investment we already have made into infrastructure so that we are not just continually buying more and more. So Afformix can help with that problem by providing reports about actual resource consumption. In this case, I'm going to show you how this is a project. You can generate reports over different time periods. You can generate those reports for resource utilization across projects or across hosts, across a physical layer. In this case, what I'm showing is the resource consumption of all the projects inside this open stack environment. I'm going to focus here on this project named Afformix. And these histograms give a very quick snapshot of the resource utilization during this time period that I generated the report. In this case, if I look at the VM CPU utilization, these are histograms that show, in this case, 25 of the instances have been using between 0 and 20% CPU over this period of time. Similarly, about 0 to 20% of the memory across 21 instances. And I can quickly kind of see a correlation there that at least 20 instances are most likely not using CPU or memory very much. Perhaps they're idle. Perhaps a user created them with a plan to start a project. Maybe the application is not really being used anymore. Or maybe they created an instance that's too big for the workload that's actually running inside of it. And they'd be better off with a smaller instance and thereby freeing up physical resources that I could give to other projects. With the goal being that we can really maximize the utilization of the hardware that we have. And we can drill into this information more detailed. I could look at the instance and sort by the instance CPU over this time period. I can see, well, there's five instances here that didn't use any CPU over that time window or any memory. So clearly, these are just kind of sitting on the infrastructure. They're consuming resources, not actual resources, but they're consuming allocations that OpenStack cannot give out to other users based on the policy I have for oversubscription. In addition, we have a policy-based cost model. So the operator can define on a per flavor basis the cost of resources and thereby charge back to its users and maybe incentivize those users, you know, each department to kind of manage their own resource utilization effectively, as well as perhaps compare the costs of what it is going to operate inside the private data center versus the cost of run their application maybe in a hybrid model or in a public cloud situation like Amazon. So the operator could use this to say, you know, look, compared to Amazon, it's cheaper to run it inside our data center, you know, and we're saving this much money by doing it. So I mentioned in the beginning the idea of this risk showing up for an instance. And this risk is all based on policy. So when we are sharing resources, when many projects are competing for them, an operator is kind of charged with the responsibility to maintain availability of the infrastructure. And we can't be staring at a dashboard all day. We can't be clicking on here and watching what's going on and watching charts. So what we really need is a way to define policy that expresses the SLA and can notify when that SLA is not being met. I'd like to show you a little bit how we do that. We call it health and risk. In this case, I can define a policy for my host risk that can be a combination of a number of rules. In this case, I might choose that if the health, if I can't get a heartbeat, if the host is not even alive, then clearly that's a problem. I might also say if my 15-minute load average is above, the normalized load is above 70%, then I think that's a problem. There's constant demand over a 15-minute window that's above my threshold for availability and bursty workloads that I might need to be able to handle. And I can define several rules over any of the metrics that the AppFormix data platform collects, including down to processor-level metrics we've integrated with Intel Resource Director Technology to even allow visibility into the cache lines and the memory bus bandwidth. But you can define these rules across hosts or instances. And you can scope them by aggregates, for instance. If it's a host-based rule, I can apply this policy to all the hosts in my organization. I can apply it to just certain hosts that are part of an aggregate. And again, the metrics span from CPU to memory to disk space to disk I.O. to network I.O. We even look at things like the smart counters that hard disks have inside them, the hardware counters that we can use to predict disk failures before they happen, as well as like normalized load. And so what I'd like to demonstrate right now is how this policy works. Because what's key here is that it is a policy, and it does apply as things change in your infrastructure. So as VMs come up and down, the policy is automatically applied to VMs in a project. As hosts are added to a host aggregate or removed from a host aggregate, the policy for that aggregate is applied as well. And the dashboard that I'm showing you is very nice for visual representation and showing a demo. But in reality, operators really need to think about infrastructure as code. And a lot of people want a manageable environment that's reproducible. So we, of course, support REST-based APIs for any of the configuration that I'm showing you on the dashboard. Just to give you a taste of that, this is some JSON that would be input into an API to create a new alarm. And I'm going to name this alarm Hadoop Memory 85. What I'm interested in is when the average memory utilization is above 85% over a 15-second window. And what's going to happen is I'm going to configure this policy to generate that alarm for all the hosts in that aggregate. The policy is going to be pushed out to our agent that collects metrics across all the infrastructure, all the hosts that are belonging to that aggregate. And the evaluation is going to be happening at a very fine granularity. And any time the event is seen, the agent will be pushing that event all the way up to the dashboard. So I'm going to jump back into the dashboard here. Let me just refresh the page for a second. Jump to a particular host here. So we can see here that the alarm did get programmed. And in fact, less than a minute ago, it became active, meaning that we're above the 85% threshold. And we can see here that the host memory is kind of fluctuating around the 80% mark. It must have spiked above 85% to bring this alarm active. And that agent is detecting that all in real time. And the policy-based nature means that once I have that rule in place, I can then configure other hosts to be maybe part of that aggregate. So if I bring a new compute node online, I could add it to the Hadoop host aggregate. And right away, the policy is applied. It's going into learning state, meaning the agent is watching during that first 15-second window before it can make a determination if the alarm should be active or inactive. And as soon as the determination is made, the event will be pushed all the way up from that agent to our dashboard. It actually enters a message bus that stores it both in a database so that we can have a history and, in real time, generates it to the UI or any other listeners, including things that want to listen on the API itself. The final thing I have just a minute left here, the final thing I want to show you is how we can take that SLA and actually influence the scheduler inside of OpenStack. I'm going to jump over to a different cluster really quick here, where I have already defined six instances. I'm going to place some load on one of the hypervisors, such that that hypervisor will not be meeting the SLA that I've defined. If I look at all the hosts, I can see here that the risk will go to red after that CPU load is detected in just a moment there. We see the CPU usage is above 90%. What I'm going to do is launch some instances. And the Nova scheduler is actually going to be influenced by a Formix to not place new instances on that host that's not meeting the SLA that I've defined. So what I'm going to do is define, I need to refresh probably and log in to Horizon. I'm going to create six instances. OK, so while those are booting up right now, and what we're going to see is that because that Formix can influence the scheduler, we're actually going to limit the OpenStack from starting them on a host that's not meeting the SLA, because we don't want to compound that problem of too much load on the host. These look like they're almost spawned up. If I go back to this view, quickly refresh and see all of the hosts, what I can see is now, this host still only has two instances running on it. This is the one that's out of policy. The other instances have been scheduled across the other two hosts. I added six instances, three and three went to each of the other hosts. The other hosts were out of, it looked like the CPU load was spiking while those instances were being created, and now it's gone back to a healthy state. If I do, in fact, stop the load that I was running, I can observe that the risk should go away. If I start additional instances, then OpenStack will now schedule them again across all of my hosts, because they're all meeting the SLA and they can accept new VMs being spawned on top of that hypervisor. So we're watching those spin up. Still getting scheduled for a moment here. So now we've seen that OpenStack has scheduled across all these instances. This load is caused by the spawning of the instances. It'll subside in a moment. That's the end of my presentation. If you have any questions, I'll take them offline because I'm out of time here. Thank you very much.