 So, my name is Ken Pepple. We're here to talk today about deployment testing. This is something separate and distinctly different from a lot of what you hear over in the design summit across the street today, whereas we do a lot of testing on the development side, specifically around Tempest and other things to ensure code quality. What I'm really going to be talking about today is something closer to the operation side. Once you actually have a deployment, you actually have a cloud out there. How do you test that actually everything is working and working to spec? And so today I'm going to talk to you about a little bit of what the problems face is, a little bit about solutions and things that we've pulled together to use at clients, and then a little bit about the future, about how we could make this better for OpenStack. So with that, my name is Ken Pepple, I'm CTO for Selenia. Selenia is a consulting and development company that helps people go to the cloud, and specifically go to OpenStack Clouds. I've been involved in the community since the bear release, and I think I'm one of the few people probably out there who are both actually a code contributor. I wrote Flavors, if you ever swear at Flavors because it's not very flexible. I wrote that a very long time ago. But I've also run a large production cloud at Internet, one of the first public clouds that were available out there. I ran development and operations for them for about a year and a half. So I've seen both sides of it, the code and the code creation side of it, as well as the actual pain of operating in OpenStack Cloud, especially two or three years ago when things were perhaps not as robust at the time. So what are we going to talk about here? Well first, why is it such a problem that we actually need tools to test our deployments? For a lot of you that have been in IT for a while, a lot of applications, you just roll them out and they just work. Or you roll them out, you run through some quick smoke tests, and everything should be fine. I think what people who've done a number of deployments out there will tell you though is there tends to be this, we've got it installed, and we do some things on it, and then oh my gosh, this doesn't work, and we do some more things on it. Oh my gosh, we forgot to do this. And that same process keeps going through every time we make a change to it, every time we do an upgrade to it. And because of this, you kind of end up with both with your audience or your end users, as well as with your operation staff, this kind of weariness to make changes in your cloud. And there's a weariness around what they think uptime is in that cloud. And what I'm here to talk to you about today is how can we make this more systemic? How can we make this more repeatable so that we don't get into this constant are we done yet? I'll talk a little bit about types of testing, and then I'll get into actual tools that I've used out there for testing on my customer clouds as well as things I used to do internally in my public cloud, and show you a little bit about how those work. And then finally, talk a little bit about future improvements, things that I would love to see, and hopefully there's developers in the audience, that we could add into OpenStack to make this easier for us. So let's talk about the problem to start with. Everyone kind of knows there are many components within OpenStack. And if you sat through the presentation just before here, there are not only many components within OpenStack, but then there's third party components that we want to put in our cloud, things like SAP, which aren't particularly actually part of OpenStack. We have even more connections. Many of the OpenStack components only live to service other OpenStack components, for example, Keystone or Glance. And because of that, there's many interdependencies between those. You may see failures in Nova, and you're really seeing a Glance or a Keystone issue. What's unique, though, on the deployment side is really no two clouds are absolutely identical. Unless you're just rolling up a single all-in-one cloud, almost everybody's cloud is slightly different. And that's because there's a huge amount of user or end user deployment choices here. I can choose which hypervisors, and I can have several of them. I can choose different authentication means. For example, at Internet, we didn't use Keystone. We already had our own third party authentication system for all of our users, for Colo and things like that. And we wrote modules to actually integrate into that. There's many different storage choices. So for example, just within OpenStack, there is both Cinder and Swift. Many people will opt out and do something like Cef. So there's a lot of different deployment choices. And then we have a number of, well, quite truthfully, rarely used code paths. There are features that are out there that truthfully not a lot of people know about. Block storage backup instead of clone. I've never seen actually anyone do that in the wild before. There's many different code paths within Keystone, for example. Not a lot of people out there actually use policies. And so there's these kind of corners that everything seems like it's working until you try to exercise because you have that one problematic user who likes to use things like that and break stuff for you. And they find that. In addition, we have a large number of integration options. You're almost always going to be integrating into other things, whether it's a third party like Cef, but more likely a third party within your corporation. For most of the enterprise customers out there, they want to be able to integrate into Active Directory. They want to be able to integrate into Remedy or some trouble-ticking system. And of course, everybody's Active Directory is slightly different. Two last problems, which are uniquely on the offside, though. There are just not a lot of operational tools for OpenStack today. And you've seen this with a lot of other infrastructure that's been out there in the past is that we see a lot of functionality, and then we see a lot of end-user tools, and ops people kind of get left behind. Well, they can write their own tools, or they'll get tools later, that kind of thing. And so like today, whereas you can go out for most other applications, VMware, SAP, and find a 30-party ecosystem of operator tools to help you with that, that really doesn't exist very much today. And so a lot of operators are kind of thrust into being their own toolmaker. And a lot of them don't necessarily have experience with kind of this new style of computing. And then finally, there's a lot of operator skill which is needed to be able to test this because almost every end-user task will go across at least five different components. And because of that, you really need people that understand a wide variety of technologies. And within the operations team, oftentimes you have specialists. I'm the storage guy. I'm the active directory person. And no one person can necessarily help you across troubleshooting or fixing any particular problem, much less writing a tool for you to do that. And so we've got a number of problems today which are kind of summed up in one picture, which is there's a lot of connections and there's a lot of programs. So how can we look at solving this? Well, I think part of it is we need to look back and look at what do we really need to be testing here? And I think we really need to be looking at three different things. First and foremost, we need to be able to test that functionality exists. We need to make sure that if we promise the customers out there that they can start up VMs, they can back them up through a snapshot and they can relaunch them, we need to be able to always have that. And so functionality is usually at the top. But I'm much different than what the developers are looking at. I'm not looking for a unit test. Make sure a particular method works. I'm not looking even within one particular application of OpenStack to see that that actually works. I need to have a user scenario. So I need the scenario needs to start from the user signs in through he gets his end state, the desired end state that he would like. And almost always, that's going to include Keystone, as well as several other things that are in there. And unless that entire chain works, that functionality is not delivered to the client. So it is different than what we see on the developer side for testing. Secondly, I'm not actually looking for bugs in the code. I'm not trying to test code quality. I am trying to test the deployment quality. And so if there actually is a bug in the code, operations actually isn't probably the people to fix that. While it's great that we could identify it, we're not there to fix that. We're there to fix deployment problems. And so while we will certainly look for code problems, and as you see when we put together kind of an architecture here to do these testing, we'll look for those. We'll make sure that we're upstreaming those and telling other people about them. But we're not specifically testing the code. We're testing the implementation of the deployment here. I would put one caveat on that, which is unless you actually are a vendor or you're running your own code. So for example, at internet, we didn't use Red Hat or IBM or anybody like that. We used everything from trunk. And we owned our own code base. Because like I said, we wrote our integrations into our own authentication systems and such. So on those, we actually did test code as part of our operations and deployment testing. I would posit to the most of you, though, if you're in the enterprise, you've probably bought a scenario or a distribution from someone, HP, perhaps. And you really should be relying your vendor to fix those coding problems. And you spend your time on the deployment side. Now beyond just the functionality, though, we need to look at non-functional requirements. How is that actually operating for us? So yes, it starts up a VM. But if it takes 20 minutes or an hour to start up a VM, many of you users will say that's broken. And so we need to look at that non-functional requirements. And they generally fall into three areas, although obviously there are many more. But scalability, performance, security. So I've always want some kind of test around those areas. In your environment, depending on the scale or depending on the applications that you're running on there, some of those may be less important or more important. You may have other non-functional requirements out there, like compliance, for example. But you need to at least be looking at these kind of tests here. In addition, these tests really aren't covered by the developers. They're not going to have a security test on there to make sure there's a compliance statement in there or that it's actually going through policy control and your policy control, or that it's making sure that it's kicking out a remedy ticket for you. And see, those are things that we're going to have to do ourselves. And then finally, we need operational tests. Things about how we operate it and how we've deployed it, can we actually go do that? If we've said this is a high availability control plane and our controller is going to be HA, how are we testing to make sure they actually are HA before we find out at 3 o'clock in the morning? And so while there's other sides of testing here, I would say we definitely need to be testing these different areas. Now, here's the difference, I think, that will fall between what we've done in the past in operations for testing and what we need to be doing here in the future in the cloud. In the past, you had probably a test person within Ops. He had an elaborate test plan written in Microsoft Word. It had many, many tables in it. It was probably 60 or 70 pages. And whenever you wanted to do a test, usually only once that you ever did this test, which is the first time you installed it, you gave someone this huge Word document, probably the least liked person on the operations team. And you said, walk through this and write down everything that comes out of the computer as you do these tests and do them manually and do them very specifically in the order I've told you to do that. There's a problem with that, which is A, nobody wants to do it, and when nobody wants to do it, it's not done enough and it's not done properly. The cloud, because of the multiple components, because of the interactions, we really need to be able to be doing this all of the time. And so manual testing, for the most part, really is not going to do this for us. Now, this is something that, within the coding or the development side of the house, they've been well ahead of this. They did that same thing back 10 or 15 years ago, where you had a dedicated QA department and those people ran through tests all the time. And they came to the realization that this wasn't going to work between seven and 10 years ago. And this is all automated. And if you look into the open stack, continuous integration, continuous deployment process, you will see we actually have a fairly state-of-the-art testing process there. And we need to borrow some of that and infuse that into operations and into our deployment testing. And so we need automated tests that's repeatable, that can do multiple tests and multiple checks on each scenario. It needs to be continual testing. We need to be doing that every time we change something, which would be the first time we do it, but every install, every time we add new compute nodes that are out there, every time we change any of the configuration options. And in fact, I would say there's a subset of tests that you should be continually running, that I'll talk about later. That also, though, needs to be part of general operations procedures. It is not a separate function, QA. If you look over in the development side anymore, they may have a few QA people left, but there's not a hard line between developers and QA people. When I ran development for internet on our cloud, we didn't have QA. We had one guy from the central QA department who told us what kind of testing we should do from a procedural point of view. We wrote all the tests, though. And I would say in operations, we also need to do that, which is you need to have everyone be able to run through tests at any time. They should have both the knowledge as well as the tools to be able to do that on an adhack basis or automate it as part of an installation or an upgrade. And to do that, we need to make it easy. We also need to be able to integrate this into our operational procedures. So it shouldn't be a special event that we're doing testing. It should be just a normal operational thing. So for example, at internet, we had specific tools that would do tests for us and the significant scenarios. We actually integrated that into our standard operating procedure at tier one service support. So we had outsourced tier one support, the call center support. And within their scripts, if it came in and said, hey, I cannot start up a VM, the first thing they would do is not go through and ask them a bunch of questions. The only thing they would ask them is, are you this person and are you using this login? And immediately from there, he could run a command that would automatically run up a machine or a VM for him as him and would tell him within about a minute and a half whether or not that was possible. And all he had to do was type in one command to do that. And so very easily there, we had integrated that test into the operating procedures. And right there, we were able to go ahead and cut down call time and the number of calls in half. And so as you make that easier, it's easier to integrate into your operational procedures, even in the troubleshooting phase at your earliest support tiers, and you can greatly increase service. So let's then talk about what we actually need to do for testing. So for most of you who have been involved in testing before, there's different levels of testing out there. So there's unit testing. Unit testing is the domain of a developer. So if you've ever done commits for OpenStack before, you've gone through the process. There's a tool called Garrett that's out there. Garrett, you put up a code request and then a large number of people come to criticize you about both your coding style and as your general intelligence, but it also goes through and actually does a number of tests on your code for you. Unit tests need to be part of any new feature. So if there's not any new feature tests or if there's not any new unit tests to test the new methods that you've wrote, it'll almost always get kicked back to you. But those are really doing things at such a low level. They're not extremely useful for us in operations. They really should be left for developers. In addition, there's also integration tests. Integration tests string together several methods to be able to do something. This is closer to what we want, but really not exactly what we want. It's not really a user end user integration. They really want a scenario, something that actually does something and provides something back for them. So we really want to hone in on what most people will call user acceptance tests or a plain English test of, I was able to log in and start up a VM of a certain size. I was able to attach a volume of a certain size and create a file system on it. Or it may be even, you know, longer and it might be something like, I was able to create a Hadoop cluster. And so we need to really focus in on those kind of things, which will be very specific oftentimes to your enterprise and even the applications on top of it that you use. Now, some of the tools that we have won't do all of those things for you. You'll probably have to add some things onto that. But we do have some higher level tools like that that will provide tests for us on this. And then the other part here is unique to operations, which is operational test. Make sure that we can fail over things. Make sure that if we have a security breach that is somehow caught, make sure that we're doing compliance and things like that. And so the two things in bold are really the things that we need to look at. We need to make sure that we're testing for. In addition, there's all these functional areas. I won't go through all of these here, but the important part is the things outside of applications. Your cloud will not just be open stack. And in fact, when I ran development, I had somewhere around 16 or 17 people. There were four or five people which never did any open stack coding, even though they were open stack developers. They wrote integrations into our billing system. They wrote integrations into authentication. They created images for me, all of these kind of things. And so make sure that you're paying attention to these things outside of open stack, but which are quite needed, but not necessarily gonna be covered by any of the open stack tools. And so let's look at three different ways that we can do testing. And now we're getting into specific things about open stack here. And so first and foremost, what most people know about in open stack testing is called Tempest. Tempest is run on every single commit to every single open stack project, and it runs a huge suite of tests for them. It is mainly in the developer realm though, but it does do and have some overlap into the testing that we'd like to be able to do. There's a newer part within open stack called rally. Rally is much closer to what we're looking for, and I'll spend a lot of time on that today. Rally kind of up levels Tempest to be able to do Tempest type test, but also to string together Tempest test and common user scenarios so that we can then benchmark them as well as verify their correctness. And finally, there are gonna be some manual tests that you're gonna have to write. There are gonna be some things within your organization or your enterprise or your service provider which are just specific to you, and you're gonna need to be able to write those kind of tests yourself. So let's look at Tempest. So Tempest, an old open stack project. It's been around since almost the start. It runs hundreds of different tests. So if you've ever run it before, you kind of kick off Tempest, you go get a coffee, maybe take a little nap, get another cup of coffee, come back. Tempest is still probably running, or it's, I had an error for you, but it runs a number of things like every unit test for every program. It goes through and does some scenarios, probably not the scenarios we need, but it does some of those scenarios. It does stress testing. So it repeatedly beats on, for example, a particular code pass like an authentication or a token creation code path out of Keystone. It determines API correctness to make sure it's actually coming back with the correct API commands and responses as well as in the right format. And it tests all of the CLI, which is somewhat difficult to test within the program sometimes. The problem here is it's actually quite complex to be able to configure this. There is actually a larger Tempest configuration file than you actually have within NOVA. It is not portable to every cloud deployment configuration. So for example, because there's different deployment options within NOVA, different hypervisors, all of those tests reside in Tempest. But for example, at internet we were a Zen shop. So we had Zen hypervisors and some Hyper-V hypervisors. So the majority of the tests were written for KVM. We couldn't use any of the KVM tests. We couldn't use any of the VMware tests. We had to write some additional Zen tests and things like that. And so it's not exactly portable for every cloud deployment. And you need to understand what the tests actually do to exclude them from your test run. Otherwise you get a ton of errors. So even if you've completely configured correctly Tempest, you still need to have the knowledge and be close enough to development to understand what these tests do and whether or not they're actually relevant. Some tests will fail on your cloud and they probably should fail because you don't have those specific things in there. And because of that, it really requires a significant investment in time and skills to be able to run this. And it's a continual investment. Just because you were up to Tempest within Ice House, chances are you're gonna have to learn a lot of new stuff in Juneau because we're constantly adding tests to that. And so because of that, we really needed a tool which was slightly higher level than this. And this is where Raleigh came in. Raleigh's a framework for validating performance, benchmarking, as well as some verification test. And it uses Tempest at times, but it also will do other scenarios for us. It's a separate program that you turn and point at your OpenStack cloud. And then it actually goes in and records not only the correctness of the test and what tests it ran, but also records the results and benchmarks of those tests. So that you actually not only have the tests that I've run, but you also have an historical view of the tests I've run and how fast they ran. And so the most important thing within Raleigh, I think, is what are called scenarios. This is a scenario run, it's a specific one. I'm creating and listing users. So there's some code path in there which tells go out and create a number of users and then go ahead and list them. And what you'll see here is that within every run or every scenario that I have, I will have different actions. It will give me the statistical latency on those actions, the success, and the concurrency amount. So unlike Tempest, which runs most tests once, Raleigh is really made to run them concurrently many times with many different users. As you'll know, if you've been in operations before, just because it works well on your laptop does not mean it will work well across five different data centers and 1,000 machines with 10,000 people on it. This is more of a way to get a more real-world look at that, not only correctness, but latency, performance, and scalability. As I said, Raleigh keeps the database of all the tests that you've run. And so I can look at each test, it gets a UUID like most other things with an open stack, gives me a status, and then whether it failed or not. And then I can also tag them, for example, with anything that I want. I can then go back with it. So for example, I can run tests today and it's failing. I could go back to one yesterday and it's passing. And I can see the differences of where it's actually passing and failing. I can also go back and say, seem to be running really fast last week and we're running slow this week, or specifically when the end user comes in says things are slow. You can actually have a very objective way of saying this is how long it took to create a VM last week. This is how long it took this week. They appear to be about the same. This is probably user error. So it allows me to actually do some trending and benchmarking. It also is customizable. So within each one of these scenarios, you'll see that there's actually a script. And at the top, it's actually calling a method. And this one's particularly create delete user. I can customize it with some more arguments and those arguments are specific to the scenarios. But for this one, they're about 10 characters and name and things like that. And then I can change the concurrency of it. In this one, I'm gonna do 100 times and I'm gonna do 10 at a time. And so I can crank that concurrency up or down, depending on what my user community looks like. But I also can track down intermittent errors. And this is especially important in your Keystone environments where you're actually dependent upon a third-party authorization. So for example, the customer that I was using this at, we had a flaky AD connection for some reason that we were never able to pin down. And I could run this concurrency, I'd crank it up to 1,000 times, run it like 20 person concurrency and I could actually see those errors and bring those errors back to the Active Directory team. Whereas if I ran this myself, every time I'd run it in front of them, of course it worked perfectly. And so this gives you some help, especially with your integration points and especially with load testing. Because those intermittent problems can be very difficult to track down. Rally scenarios are written across a large number of the different components which are out there. As you can see, everything from authentication to some of the new things like Sahara, Zacar and different things like that. In addition, you can see there's tempest in there. I can actually use it as a test runner for tempest and I can skip over some of the more complex configuration options for it, as well as just run a small subset of those tests. So for example, it can do just pick the number, the test that you want and give me the names of them for a scenario or I can do a regular expression, a regex and write for example, run only glance tests if I'd like. And I can set those up as different scenarios then. And then finally, it provides a visualization for you. So at the end of this, I can go ahead and plot any one of my scenario runs and I get a number of visualizations for it. And this particular one is showing me the test table of results. It gives me a chart. It will also have a few more of these, which I'll show you here in a second, but it'll actually break down different steps so you can figure out where you may be having a performance or a scalability issue within this. So with that, let me skip over and actually show you how this actually works. So, is this up? Okay, you can see most of that. What I'm gonna do here is on my laptop, I have two VMs running. I have a VM that runs Rally and then I have an all-in-one server, an RDO server that's running all the OpenStack services. I'm on the Rally server right now and what I'm gonna do is tell it how to contact my cloud first. So I'm gonna take my Keystone RC file, I'm gonna go ahead and source that so that I have all of the administrator passwords. I need the administrator passwords because I'm gonna be doing some things which require those administration APIs or administrator APIs. And once I have that, I can go ahead and create what's called a deployment. So with Rally deployment, it's gonna create a separate cloud definition for me. One of the nice things about Rally is, it's actually meant to be able to test different clouds. And so within your environment, you'll be able to have a centralized Rally server or test server and be able to have it point at different clouds within your environment. For example, your test, your stage, your production clouds, all from one. And so you'll actually be able to also run the same test against each one of them and see what performance differences are. So for example, I can run something in production as a baseline. I can go to my test cloud, make some changes, configuration changes, see if that's faster or slower for me. And so I've created a deployment here. Once I have my deployment, I have a cloud. So with that cloud, I'll make sure that it actually can contact it. And it'll come out and tell me what's in my cloud and it makes sure that I actually can contact my cloud. From there, I can do a number of things. I can, for example, do verify where I'm gonna run a full set of tempest tests against it. I can set up a benchmark, but usually what I do is set scenarios where I picked up a number of scenarios that I like that are out there and they're in my test suite and I'll go ahead and run them. So let's run one real fast. So I've picked one file out here, which is my create user and keystone scenario. I will tag it and I'll run it. From here on out, it's gonna go ahead and execute this according to the scenario file that I've created and I'll show you that here in just a second. But basically it's spawning off a number of threads, a number of workers to run the concurrency and run the tests that I'm looking for. I'm doing this right now in non verbose mode. There is a verbose mode, which will give you pages and pages of all the debug commands that it's using. But right now it's just going off and doing those particular things. When it comes back, it takes all of the information and all the output and stuffs it into a SQL-like database. And with that, I'll be able to go in and actually query past runs, whether they ran or not, or they failed or went through completed. And I'll be able to pull back all of that information. In addition, I'll also be able to go in, and maybe if I, a little bit better, I'll also be able to go in and plot all that. So you can see here, actions. This one has a single action, Keystone create user. How long each one of them took in minutes, sorry, in seconds, minimal, maximal average, 90% tile, 95% tile. How many of them succeeded and how many I actually ran. So very quickly, it's given me, okay, it takes about three seconds to create a new user in Keystone today, and almost always it's less than five seconds to actually do that. Good to know, little slow, but good to know. Once I have that, I can then go ahead and plot it. So with this, I've plotted it, and it pulls up a nice JSON and applied webpage here, which I can look into each one of these runs and find out how long things were idle, how long things actually ran. Successes. I can come into details and break out how long each one of the particular actions took. In this particular one, you can see it just creates and lists, and you can see creating the user took most of the time, not listing the user. In other ones which are more complicated, for example, scenarios where they start up and reboot a VM and things like that, it'll actually break all of that down into the specific actions. You can see where there are problems. You can also then see where most of the time is spent, 88% of the time here is in create user, and then you can see it across a histogram here in duration of how long it's actually taking there. And then finally, it actually shows you the configs. And so just like before, I'm using a concurrency of 10 and I'm doing it 100 times. All of this kicks out to a webpage and you can have it automated do that for you so that you can go in at any time and look through all of these statistics. So that's rally for you. There are a large number of scenarios in there. In addition, you can write your own. So within each one of those areas that I talked about earlier, there were, I don't know, 20 areas that you had scenarios to test. You can also go in and write your own. There is a large number of methods that is actually built into rally or into tempest and you can actually create your own scenarios in there. This does require some Python knowledge though, but you can do it. The code is fairly simple. If you've done some OpenStack or you've done some Python before, it's actually fairly straightforward. Truthfully, the only things which are somewhat complex is the threading part of it because of the concurrency. But for example, we've had to customize this before. One of my clients, as you can see, all of these always said create a user and then list them. Well, my user had a static or read-only active directory so you couldn't create. So we had to make a version of this which used static users. We were able to modify that within a day. So the code is fairly simple and it is something that you can actually change and script yourself. So beyond those functional tests where I'm hoping that you'd be able to find preset functional tests already done in there for you or be able to string together a few and be able to get a test document together, you do need some operational HA testing and this is probably the one place that I do say you probably need to do manual testing. In general, it's very difficult to correctly do an automated HA test to have something fail over and fail back and quite truthfully, you tend to do it manually in production anyway, at least the fail back part of it. And so we need some kind of operational test here to be able to kill processes or hung processes or killed processes. We need something that talks about if I lose network connectivity, what if I actually lose an entire machine or if I lose an entire rack. A lot of those are gonna be very custom to you and maybe really truthfully dead or done in a manual process. But those operational tests do need to be out there. You may need to have others, I think almost everybody will need to have these high availability tests, but you may need to have others and they may need to manual. The one thing that I say though, just because they're manual tests does not mean they're not continual. So continual testing I think is important and almost an absolute must if you're running a 24 hour high availability cloud. It should be part of your general monitoring and management status. In fact, at some of my clients, we actually took the output of rally, we ran it in a cron job and then actually fed back into their monitoring system, whether it was up or not, as well as the mean time for it. So they could actually track those scenario times. In addition, we like to have something at some of my other clients called the order canary. In general, we reject some of the Netflix things out there where I should be just having a chaos monkey that goes out and randomly kills some of my applications to like keep my developers on the toes. But we do like having the order canary. And the order canary is every five minutes, every 10 minutes, I'm gonna go ahead and do the most common user action, which is I'm gonna go out and start a VM, I'm gonna attach some storage to it, I'm gonna ping out and do a YUM update and then I'm gonna kill it. And so constantly I always know that I have at least some level of the ability to sleep at night, knowing that someone will tell me that hey, I can't start up a VM before someone calls in like my CEO or something like that. And so we tend to have this order canary that does all my happy path testing and sometimes we can use rally for that, but oftentimes we've done it as a command line tool because we could also then give that to our operational staff like people answering the phones. And so they could just type canary with the username that they got from the user who called in and immediately would find out, yes, I can start up a VM as you. Obviously, you seem to have the problem I don't. I would say that this is something that you'd like to make. You'd also probably wanna have some set of tests that are business critical also hooked up into your monitoring. So while I wouldn't say let's run all the tests all the time, I do pick usually a subset of those tests and make sure that they're always running and just put them within Cron, feed them into your monitoring system in some way and then I'm making sure that always that's running. The big thing that you always want is that we wanna make sure this is operationalized. Testing is not an event, it's just normal operating procedure for us. So to put this together, we usually have an admin box somewhere in our cloud and that's admin box can usually talk to each one of our clouds, our test cloud, our lab cloud, our actual production cloud. So we usually run at rally on that. Rally takes up very, very few resources. We usually run it in a minimal VM with one CPU or one VCPU. We can also bring it down whenever we want. I usually leave it on my laptop so that any time I go to a client site and I'm actually working on their cloud, I can run ad hoc tests very quickly just by creating a new deployment. And then I make sure on the other side of it, I've got monitoring. So one of the key things that we need to make sure that in addition to us running tests, we're actually monitoring that test so when we do find something that fails, we have some way to actually reproduce that or go in and troubleshoot that. And so we usually will hook that up with something on the monitoring side to be able to do that kind of root cause analysis. So I guess kind of in closing here, what are the best practices that I think you should do? First of all, always be testing. Testing is part of operations, it's not an event, it's not a person. As much as possible, you need to be transitioning manual testing into operational procedures. Whether that's tools, whether that's a procedure that's built in to the standard way you do things, it needs to be part of it. So one of those things, I know a lot of people don't like this, but failover regularly. If you're using HA and you say, I've never seen a failover before, whenever I hear anyone of my staff say that, it's like, we're failing over right now and you're gonna fail it over for me. Failover should not be an event, it should be just how we do things. And quite truthfully, we should be able to failover almost at any time. Testing also needs to be part of all of those processes for you. So even if you're doing continuous integration, continuous testing, or you do it on a quarterly basis, that testing should always be part of it, should always hopefully be automated for you so that it's not a big thing. As part of that, you do need to make sure that somehow this is actually put in to your monitoring scheme. Now there's a variety of ways to do that. If you're using Nagios or something like that, you can write it in as part of one of their external commands. If you're using Xabix, you can do a similar thing to put it into a template. We wrote our own monitoring system called Goldstone, which we make available free to people. We like to be able to see the logs so that whenever we see a trace command that comes out of it that institutes an error, we can see it. And we make sure that we log for all of these things. In addition, we actually have it look at the rally logs. And so we can find rally tests that didn't go well too. Because remember, testing is just part of procedure here. So for the final part here, if there are any developers in the room or people that want to become a developer, things that we need. Today, it's very difficult to describe what your cloud looks like. So yes, I know that I have a cloud and I know it's API level one for Nova. We need to be able to expose more information. We need to be able to expose with security, for example, what hypervisor capabilities we have. Do we have live migration? What kind of host aggregates are out there perhaps? We need to have more things which describe the capability of our clouds. Swift has taken the lead on this. So if you do a Swift slash info, it actually tells you what capabilities it supports, whether it's sports container sync and those kinds of things, whether it supports the WWW export and things. And I'd love to see that in all of our services so that as an admin command, I can go in and programmatically figure that out. That would lead to us being able to tell Raleigh, here's how you should test my cloud. I would love to see Tempest have a simpler configuration, perhaps a DSL, so that almost anybody could use it out there or that it was very easy to run clusters of tests. And then finally, I'd love to see more of the monitoring projects out there include testing within them. So with that, I think I'm running over and I'd like to thank everyone for their attention and I hope you have a great rest of the show.