 Good morning everyone. My name is Travis Newhouse. I'm here with Justin Shepard. I'm Chief Architect at UpFormix. UpFormix provides tools for operators to better manage their cloud infrastructure. We have a data platform that does real time distributed analysis of resource utilization across your entire infrastructure. And we build on top of that tools so that operators can analyze SLA, provide orchestration, as well as charge back and capacity planning so that you can manage how your physical resources are being consumed by the virtual elements inside your OpenStack environment. We also enable you to provide self-service tools to your users so that they can set their own monitoring policies and alarms. And we integrate all of this on top of both OpenStack and Kubernetes. And my name is Justin Shepard. I'm a distinguished architect at Rackspace, and I function as the CTO for our private cloud business. Rackspace's place in the world in private cloud is really to be your operating partner. So we do have deployment tooling and everything required to be able to lay down OpenStack. We've open-sourced it as part of the OpenStack Ansible project. But really what we want to be able to do is manage clusters for people. A lot of times companies will get into the private cloud game, and they will spend resources on managing the cluster or managing their clouds. And that takes away from being able to use those resources for the things that make you money. So a lot of our value is in operating those clouds for our customers and hundreds of customers, thousands of nodes at scale. But we're able to deploy OpenStack anywhere all back by fanatical support and expertise. So it really is around being an operating partner. As I said, we do have a deployment framework. Seems like everyone has got an installer. Seems like there's also a lot of conversations around how to get OpenStack deployed, how do you get it down as fast as possible. Everyone's got their own widget. Unfortunately, installing OpenStack is 1 millionth of a percent of the problem. Apparently math is hard. I think that's 1 10,000th of a percent of a problem. So really everything that is interesting happens on day two. So day two to the end. That's really where you spend all of your time working with an OpenStack cluster, managing applications on it, upgrading software on it, being able to squash vulnerabilities, be able to upgrade new code, bring new capacity online. And this really does get challenging whenever you're looking at it at scale. So again, we've deployed hundreds of customers on OpenStack, very large clouds, thousands of nodes in a single region, multiple regions with thousands of nodes total. And so the challenge at that level is vastly different than what it is on a couple of nodes. So trying to figure out how to put new OpenStack code on a cloud that's a couple of nodes big is not that challenging. I mean, you can really get through it with hand, rolling all the stuff. I mean, it's not the ideal way of doing it, but you can get by that way. Whenever you're trying to figure out how do I patch a vulnerability across a cluster of 1,000 nodes, that really becomes challenging and interesting. This is fleet management. So today I believe is the operator summit as well. There's a large operator talk happening. Those guys are facing this challenge every day. If you're also facing similar challenges, I highly recommend you go get involved with that. There's lots of large companies running lots of large clouds trying to solve these problems. At Rackspace, we run one of the biggest public clouds based on OpenStack, so we do have a history of managing these type of challenges. But I do highly suggest going and checking out some of the fleet management. Once you start trying to figure out how do I apply a patch across a 1,000 host, or how do I upgrade a part of OpenStack without taking down all the other parts. The next thing once you get past that is really about how do I take all of the data that's flying around in my cloud and put it to use. I've got all these events that happen. I've got instance bootups, volume attaches, network creates and deletes. In addition to that, you also have just a bunch of different components that are all reporting and sending data or having events happen to them. Being able to collect that is a problem. Being able to visualize it and do anything useful with it is a really big problem. One good example of that is really around capacity management. So whenever you are operating a cloud at scale, being able to handle new workloads coming online and really making sure that your cloud is not a bottleneck for the innovation of the rest of your company is a challenge. One thing that we find is that a lot of clouds are built as general purpose clouds. So almost every single customer, whenever they start down the road of I want to deploy and opens that cloud, they've got a couple of applications that they're specifically talking about. They may be Greenfield, they may be new applications that they're rewriting. And so they might have some idea of the workload that they're bringing on. So it's performance characteristics, how it behaves. Is it network dense? Is it CPU dense? Is it high memory? Does it require high IO? But a lot of times beyond that, there's a thousand apps that are gonna be brought onto this platform over the next couple of years that almost no company has a good sense for what that's gonna do to their cloud. And so up front, they try not to spend a lot of time optimizing the cloud for those workloads because it's all unknown. And so you don't know what style of compute hosts you might need, what kind of network saturation on devices that you might need given all these workloads that have yet to materialize on your cloud. So we end up building them as almost all general purpose. So it fits, it does what it's supposed to do and it works, but it's not highly optimized for any application stack. What you need to be able to do is collect all the data that's flying across the system, be able to see all the events, see all the usage of your environment, and then be able to start predicting for workloads and classifying them and be able to start bringing online capacity that meets those needs. So you might start getting down the road and realize that a bunch of your new applications you can come on are gonna be big data. And these might need huge memory instances inside of your virtualized instances, or you might need a lot of IO, right? They may be doing number crunching, so they do job processing, check out a bunch of data, do a lot of computation on it so your CPU bound and disk bound and then file it back. So being able to visualize that data and get a handle on it ahead of it being a problem is a big challenge. The second one is being able to take that data and inform those decisions. So be able to actually watch what's happening on your cloud whenever you're bringing applications on and new workloads are coming on and being able to in real time react to those and hopefully in a predictive way, be able to provide capacity before it becomes a problem and before it comes a bottleneck. The last thing on this is that really that those workloads characteristics do change over time, right? So you might start out with a couple applications that are front end web applications or they're records of engagement, systems of engagement I believe is the right term. And so those might behave in a certain way, you can scale them in a certain way, the capacity restraints behave appropriately with that type of application. But over time as you start bringing on databases those start to look different as you start bringing on big data environments, those start to have different performance characteristics, like you might need to be increasing network throughput to your storage system, you might have a heavy sender case that you didn't anticipate when you started. So being able to kind of capture that data and be able to plan accordingly for as those workloads come on is very important. Another key issue that we see in operations is really around multiple system integration. So OpenStack is a collection of a bunch of different services, right? So everyone's familiar with SOA or the new version microservices. OpenStack tends to function the same way. Almost no task is accomplished by any single service. Every single one of them crosses a bunch of domains. To spin up an instance, right? You might interact with Keystone to get the auth and then you make the request to Nova. Nova has to go and schedule reservations. You might also be preloading sender volumes whenever you directly boot. You have to interact with sender and then you also have to provision your networking. So you're crossing four or five boundaries right there. In any scenario where things fail, we find inside of OpenStack that those failures are almost always related to some other service than the one that you generally see the symptom. So as an example, whenever you're booting a VM, let's say that it fails to boot. The symptom is Nova's having a problem. But most of the time, it's not necessarily Nova. It could be. But it also could be Nova is trying to interact with Neutron to get network reservations and there is something wonky happening inside of the auth system that's preventing Neutron from properly accepting a request from Nova. And so this is a real challenge whenever you're looking at any of these distinct systems and being able to aggregate the data across all those systems and come to a root cause. We see that in most scenarios, whenever we're running these cluster customers and we're reacting to issues, it almost always is another dependent system. And being able to aggregate all those data from all the disparate systems into a system where you can quickly visualize a cross and react to that data is a big challenge. So being able to pull all of that data into one place and perform a root cause so that you can quickly get back to bringing your service back online is really a large part of the operational task. And then OpenStack also has its own infrastructure. So OpenStack is providing infrastructure to your users. It also relies on some infrastructure. So being able to tie back across to, say, a Galera MariaDB problem or a RabbitMQ cluster problem, these really are cloud hygiene. Being able to keep your cluster healthy. I cannot tell you how many times in the middle of a firefight we've found out that Rabbit is really the slowdown. So it's not anything actually failing but maybe we have a queue that's getting clogged or we have a bunch of data that hasn't been pulled off the queue and processing time is now taking a while. And so we're starting to see timeouts or race conditions because of that. Being able to visualize all of the distinct OpenStack service data in line with the actual infrastructure behind it and be able to see all of that in one place and be able to draw a root cause off of that is the big part of operations. So we have App Formix on stage with us. We're gonna go through a couple scenarios and show where App Formix actually helps in this operator capacity. And where we're able to actually give a couple hypothetical scenarios. Say a customer's calling in with VM slowness. Now you have to kind of backtrack through the system and figure out what's causing that problem. You've been given a single data point. You have a Nova UUID and it's slow. How do you fix that problem for your customers? Let's say that the marketing team is getting ready to launch an event or the Oprah effect is gonna happen. Being able to go past just project quotas and answer very simple questions of do I actually have enough capacity in my cluster right this second to be able to handle this incoming spike that I'm anticipating. Another one that we see quite a bit is badly crafted flavors. I know that a lot of times whenever you're looking at bringing applications from a legacy environment into OpenStack, some of the teams will be used to the way that they've built bare metal machines and will continue to build VMs the same way. And so we've all seen the 256 gig RAM instance with a terabyte of hard drive space because that's what the database used to look like that they ran Oracle on. This really messes up a lot of your oversubscription calculations and the way that you've planned out how many units you can put on a different compute host and so all of a sudden I've planned to be able to stack 20 VMs on each compute host and that's how I've provisioned all my capacity and now I'm able to put one or two on. So I've just taken a 10th of my capacity out. And then lastly being able to put together SLA's and policies in place where you can start to, as you've gone through scenarios, be able to say this is a leading indicator, this is a performance metric that is a key indicator for these type of problems and be able to get pre alarms letting you know that these scenarios are coming up not once they've happened. So it's a proactive instead of a reactive model. So instead of something is on fire knowing that there's a guy pouring gasoline and has a match standing next to the building and being able to stop him before he lights it on fire. And so Travis is gonna demo a couple of these scenarios for you. Thank you, Justin. So this is the part when I ask you to turn off your wifi as a sacrifice to the demo gods. Just kidding, but we are gonna do a live demo here of App Formix running on top of OpenStack and helping you manage it. What you see at the very first screen here is in the snapshot of your infrastructure and it gives you a high level of view of what's going on in both virtual and the physical layers. You have your hosts, you get an indicator of whether the host is meeting the SLA that you've defined in a policy. And then you also have your virtual resources like projects and instances. And right at the top of the stack you can see that certain instances are at risk and that means they're not meeting the SLA that you have defined. Now going into the scenario that Justin described a user might call you up and say, suddenly Mongo seems like it's performing poorly. And they tell you the project name. So we have a search bar where you can quickly find resources inside your OpenStack environment, be it an instance, be it a project, be it a host, you type it in, you can quickly navigate to the project. And what we're seeing here now is an overview of this project's allocations and resource utilization. At the top you get a snapshot of what the project has allocated, the number of instances that are running, the number of ECPUs, the amount of memory across the infrastructure, as well as all instances and the hosts on which they are running. You can expand and look at the resource utilization of a particular instance and get a snapshot of its CPU, its memory, its disk IO, and network IO. In this case we're looking for the problem and we clearly see there's something red here. It says that the CPU usage is above 90%. This is violating an SLA policy that the users put in place. What we want to find out is why, right? And so we can navigate to that instance and look at what's going on inside of it. Our agent collects these resources by running on the physical host. It does not put in any kind of agent inside the guest of your tenant. So we're able to collect these from outside of the VM and provide visibility into the infrastructure at both the VM and the physical layer. We can see here that in this case the CPU is pegged on this instance. So this is likely a problem the user needs to take a look at inside their VM. But what's interesting is that the problem could have been something else. It could have been a noisy neighbor running on their host. And we can now navigate, instead of the virtual world where we were at a project and we saw the instances that were running, we can actually find out on which host this instance is running. It's running here on this host named ACE32. And we can navigate back out to a view where we can see all the things that are happening on that physical host. And from here, we could find things like if there was a noisy neighbor happening where a certain VM is maybe using a lot of disk or a lot of memory. And that could be causing slowdown for another project that's co-hosted on top of this same physical node. So that's a quick snapshot of how you can troubleshoot and how the Appformix provides cross-layer visibility of both virtual and physical resources. The next kind of scenario I'm gonna talk about is around capacity planning. And we talk about the idea of maybe bringing on a new application or you're planning for a large event and you need more capacity than your quota might allow. A user could come to an operator and say, I'd like to increase my quota, I wanna bring up 20 more instances for this new application. The operator needs to know, do I have capacity in my infrastructure to satisfy that request? You can't just give out quota and have it start impacting everyone when the user spins up 20 instances and now all the services across your infrastructure are running more slowly. So bring up this planning tab. The operator at the top can quickly look and see across the infrastructure what the capacity is. And this is broken down by flavor. So we can see that running on this infrastructure right now there are 36 instances, six medium, there's 23 small and there's seven tiny. And these are arbitrary flavors that are defined inside of OpenStack. Each environment may have a different flavor breakdown and how their applications utilize the infrastructure. But the operator now can look at this and say, okay, do I have capacity today to bring up what the application owner wants to do? I can look and see, in this case, I only have one medium available. If I were to bring up more, give him more quota than that, he would not be able to even create the instance or I would start getting into an oversubscription situation that is not comfortable to meeting the demands of my applications. And then he can see in real time what the actual oversubscription ratio is across the infrastructure for both memory and compute. In this case, only memory is oversubscribed, but it lets you know that right now you've allocated more memory to VMs than is actually physically available by a small margin here above 6%. And the second kind of capacity question the operator needs to answer is, when do I need to buy more hardware? When do I need to add more resources to my infrastructure? When is it time to grow my data center? So you can generate reports, you can to see the trend of usage over time. And so here we're watching for basically the available capacity if it's trending to zero and you're seeing that pretty soon I'm gonna not be able to add more instances or new applications. It might be time then to buy more hardware and grow out your infrastructure so that you can meet future demands of your users. But buying new hardware is not the only way to solve that problem. And a smart operator is gonna wanna get the maximum ROI on his investment of infrastructure. So before he goes and buys, he might wanna actually find out, am I utilizing what I have right now to the maximum of its capacity? And for that, we wanna see that the actual usage matches the allocation, right? Justin talked about the flavor sizing and whether you actually have the right size flavor. And sometimes the users don't know or they've abandoned certain applications, right? They may have spun up five instances that are now kind of occupying your infrastructure but actually using zero, right? And so it's preventing you from putting new applications on top of your hardware but just because they're squatting effectively. And so if we look at the reporting page that we provide here, I'm quickly gonna show you a graphical snapshot of the report and then go into more detail on the data. So quickly we provide you with on a per project basis, how many instances are in that project over the reporting period that you wanna look at and then some of the detailed utilization. So in this case, here where there's the VM CPU utilization, this is a histogram. And what it shows is that for this project, 25 of the 36 instances or 37 instances in the project, we're using zero to 20% CPU over this time period. So they're relatively inactive and we can see that similarly correlates in disk usage and memory usage. So this is probably a good indicator that maybe this user has the wrong size flavors or they have too many instances and they could consolidate some of their applications. So we can drill into the detailed data and we can sort it. So we can look and say, okay, let me look at the instance CPU. I see that some of these instances are very active, they're near capacity but down at the bottom, there's and I can sort this in the other direction so it's faster and easier. I can see that some of these instances are really just using zero, they're not using any CPU at all. So it might be time to call up that user and say, how are you using these instances? Do they still need to be around in our infrastructure? Because as a company, AWS is happy to let you have idle instances on their infrastructure and keep collecting the paycheck. But as an enterprise, you don't want to just be sitting around paying for someone to do nothing on your hardware when someone else could be using it. And that's really part of the responsibility of the operator to efficiently manage their infrastructure inside the enterprise. And finally, what I've been showing you is a very interactive experience on the App Formix dashboard to find problems, do reports and capacity planning and drill into data that is being collected throughout your infrastructure. But operators don't want to sit on a dashboard all day long, right? And they're not able to do that. And it's really nowadays it's about automation and it's about being able to set policies and alarms so that you're notified when conditions occur and hopefully predictively and ahead of time so that you can solve the problem before it becomes one. So to that end, we allow you to do a set policy for what we call health and risk. So for example, right here, we have a host risk policy, right? So the risk is kind of like a predictive measure of when something is going to become a problem for you and maybe violating the SLA you wanna provide to your users. So inside this profile, you can select a number of rules. I'm just gonna delete it right now and create a new one so you can kind of get a sense. There's some predefined rules here that have already been added. I can select any of those that I want to be part of this profile. In this case, the profile says that when any of these rules match, I want to be notified and mark it at risk. And you can add your own rules because every environment is different. So every operator has a different threshold for when something is considered hot and when there might be a problem. You might want to, in this case, you can actually say like I wanna monitor hosts or I wanna monitor instances. If I wanna monitor a host, I can limit it to a certain aggregate. So we have a customer, for instance, that partitions their infrastructure in compute nodes for sort of general purpose and compute nodes for their Hadoop workloads. So they might have those in an aggregate and you could apply this policy. And so if I wanna generate an alert, I can notify of across CPU memory disk. We evaluate when disks are going to go bad. We evaluate when a heartbeat has failed. So you can find these problems ahead of time. We evaluate the normalized load so you can watch the demand on a single host. And this becomes a policy for an alarm. So this will apply to any host inside your infrastructure that is part of the Hadoop host aggregate. And again, some of this is not, this is a great way to visually show you guys in the demo, but in reality, you don't wanna be hand entering rules. Nowadays, it's all about infrastructure as code. You have to have a reproducible environment. You have to have a manageable environment that you can modify. So what I wanna show you right now is the fact that all of these things can be configured via API. So instead of configuring that new alarm and that policy inside the UI, I just kinda wanna briefly show you what it looks like. This is a JSON file. Hopefully you can read the font there that what I'm gonna set up is that for this aggregate, I wanna watch when the Hadoop aggregate, this happens to be the ID inside our system for that aggregate. I looked it up ahead of time. And the metric, I wanna look at memory usage. I wanna know when the average memory usage is above 85% over a 15 second window, right? If it's being sustained for that long, maybe I wanna be alerted that on these hosts, that maybe I need to take a look and see what's going on. So I'm gonna issue a command here, REST API, it's gonna send out a asynchronous request to configure this policy. And then now I'll show it to you inside our dashboard, what's going on. So let's go to this host here, Ace97. It's part of the host aggregate that's the Hadoop aggregate. If we take a look at what's going on inside its resources here, we will see that the, well, or we won't. That is the demo. So let me just give you a moment there. Did not sacrifice enough chickens. Didn't, did not configure, so unfortunately. Well, I will just happen to show you the rest of the demo as if I configure this rule by hand. That's a little bit faster. Unfortunately, I don't know why my API failed me. So we can walk through the experience you would see on the dashboard again. If I want the average over a 15 second window to be above 85%, we do have some advanced features that allow you to kind of threshold across multiple intervals and as well as integrate with notification engines like PagerDuty or Slack. I'm gonna skip those for now. We do see this now that the alarm is in learning state. So for that first 15 seconds, we're gathering data. This alarm has actually been pushed out to that physical node where the agent is running and all the evaluation for these rules is happening in a distributed fashion at the point of collection. And that allows us to do very fine grained metric analysis over very small periods of time without pushing data throughout your network, without requiring you to have a large infrastructure just to support your alarms and policy enforcement. In this case, it's moved to inactive state. If we take a look, you can see that that's because the memory usage in this node is about 67%. But again, that rule was a policy. And so now, if I look at all my hosts, if I wanna add a new host, for instance, maybe this one here, ACE32, if a new host becomes part of that host aggregate, then the policy will automatically be applied. So I don't need to, every time I bring a new host and add and expand my infrastructure, go and program hosts individually. So I'm going to take this host and add it to the Hadoop group. And automatically inside the Alarms panel here, you can see that the alarm that we configured has been automatically applied to the host. It's gone into learning mode. After 15 seconds, we'll find out whether it's active or inactive, and the system will again, continue monitoring this in real time. And pushing events all the way up from the physical layer to the central location where we log the events and broadcast them to the dashboard. So in this case, the alert is active. And as an operator, you can go and look and see, okay, I can see that the memory usage down here is 85% for this host. It's actually throttling between the two, so I'll probably expect that alarm to be right coming and going. So thanks, Travis. So hopefully, there was a couple of ways you can see how you can visualize some of this data and actually make use of the data. I will say one thing that you did hit when I was thinking about there is you're showing the actual real utilization of, say, CPU on your cluster. I was talking to the customer pretty recently where they had a corporate policy that oversubscription was kind of a forbidden thing. I mean, you can't actually oversubscribe memory or CPU. And in order to size their cluster appropriately, you had to double or triple the number of compute hosts just to satisfy a non-oversubscribed CPU. And you could take a tool like that and be able to actually have the conversation and say, I know that you don't like this, but based off of the actual workloads that you're running, you would benefit greatly from oversubscription turning it on and tuning it to, say, a two-to-one subscription rate. And now you can double your capacity without having to spend an extra dime if you'll just relax this requirement that you had. And here's, based on the data, like, unvaluable, we can't really argue the data. I know that you have a feeling on you don't like it, but here's the data saying it's okay. So that's kind of an interesting point that I just thought about. So, rex-based cantina. We'll be around all week, second in Trinity. We got a bunch of our experts out there. They cover all sorts of topics, not just this, networking, monitoring. We have a bunch of book signings. We've got the guy that literally wrote the book on Neutron, we've got operator's handbooks, they'll be doing that. But we have a bunch of experts on staff, like, we'd love for you to come out and say hi, there's tequila tasting and or beer. And so, come by and say hi to us as well, and thank you all for coming out. Yeah, thank you. I'll throw in that that Formax is also in the marketplace. We have a booth, so we'd love for you to come and join us as well. And ask those questions, come see it in detail if you'd like to see more of what we do. Do you have any questions for us? Thank you.