 Hi. Welcome, everybody, to the OpenStack Tool Shop session for the day. Well, what we have done is summarized a whole bunch of OpenStack tools. And I'm expecting by the end of the session everybody can walk away with some sort of navigation map on how to go about their test procedures and some sort of cheat sheet. Quick introduction. I'm Ruchika. I'm a CSA Cloud Success architect at Red Hat. I have a very, very varied background. I've been in embedded systems, microprocessors, worked on the Linux kernel, Android subsystem. And then I joined a startup where I got introduced to Cef and OpenStack. And now I work at Red Hat. I help customers adopt OpenStack and help develop best practices. This is Kevin. Hey, I'm Kevin Jones. I'm a Cloud solution architect for Red Hat. So my current mission is to put OpenStack into public service. So I serve the North American US federal government, state and local governments, and the education space as well, so universities. We're seeing a lot of uptake with OpenStack in the science and research community, especially around the teaching hospitals and the national laboratories. And so I actually came from that space as well. So I was formerly a NASA Langley Research Center where I was the chief technologist for IT on the IT contract there. And so I was responsible for cloud consumption coming into the center. And that was Amazon Web Services and OpenStack as well. So that ultimately led me to Red Hat due to contract changeover and stuff. So when I look out at our space, our customer space right now, what we see is we see a tremendous focus on what OpenStack is, what comes out of the box, and then how you consistently deploy that. So over the last year and a half or so has been so much focus on how to deploy it consistently. What we're starting to see is that they realize now that they're able to deploy it. They need to maintain that new shiny OpenStack over time. And so we're starting to see them think about considerations around how are we going to do that in an effective manner. But typically they're too overwhelmed to know what tools are available to them to do proper validation on their environments, functional testing, benchmarking, et cetera. They don't even know that these tools exist sometimes. A lot of our customers don't. So once you've taken that steep stair climb up to deploy OpenStack, your manager immediately comes back to you and starts asking you things about how solid you feel about your testing. How well is it going to perform? And then in the configuration we have it now, is it the best that it can be? And ultimately what it comes to is are users going to be able to adopt this technology and bring our workloads? And we can expect to be able to help them do that. So Ruchika, with these questions in mind, what should we see as our overarching goal for our OpenStack deployments? So in my mind, the whole goal is to try and build as much confidence as possible in your cloud solution. And I believe it's possible to do that with the tools we have. And the way I like to go about it is I like to break up testing into multiple phases. So I want to get validations, build a strong foundation. I want to use do functional testing, then proceed to performance testing, and then do some amount of tuning. And we'll finish off with insights. And we'll go through all the tools available for these functionalities. So like Ruchika said, if we can utilize this toolset at the right times, in the right place, and in the right ways, we can take advantage of them to build high confidence in our cloud deployments. And that's what we want to do for our customers. We do have to stop and do a little bit about assumptions and terminology we're going to use today. The first one is Triple-O. So at Red Hat, we use Triple-O to deploy OpenStack. It's OpenStack on OpenStack. And sometimes that's a confusing concept for our customers, but we believe it's a great way to do life cycle management. So what Triple-O deploys is called our overcloud, and that's where all of our user resources actually live, things like instances and volumes, et cetera. Then we have the undercloud, which is an all-in-one OpenStack deployment that utilizes Ironic to bare metal provision servers, Heat and Puppet to deploy and configure them. And then lastly, you might hear us talk about OSP Director today, but know that that's synonymous with Triple-O in the upstream, Audio Manager in midstream, and we use these three to do life cycle management on OpenStack. I don't want you to leave the room because I mentioned Triple-O and I said these things. So the tools we're going to talk about today can still be used on non-Triple-O based deployment, most of them can. So what we want you to walk out of here is how to use these and when to use these and then some examples of how they can be done. So the first one we're going to talk about is validation on the environment. When we go into customer spaces, we always have a lot of problems. Networking is a huge thing, right? We can give a checklist up front, we can do pre-calls all we want, and we get in there and it hasn't been configured or it's been changed the night before and we still have to go through tracing cables from switch to server. Disconfigurations is another thing, right? And then with our undercloud, because it's deploying on large-scale overclouds, you have to have the right amount of resources on that too. So we need a set of extensible validations that we can build on over time to make these things easier. That's where Triple-O validations comes in. So it's now officially in the Triple-O codebase. It was formerly called Clapper and it's a set of Ansible playbooks essentially that allow us to do very validation across the stages of the deployments. It works with Triple-O, RDO manager and OSP director-based deployments and it is extendable. So these are the four phases we have right now. It's preparation up front, pre-interspection of the bare metal nodes themselves, pre-deployment before you actually push your overcloud and then finally post-deployment once it's up and running. There are a few dependencies. So the first is Python and the second is Ansible core 2.0. So Ritika's gonna walk us through how you actually go get Triple-O validations installed and then a few examples of how it can be used. Right, so you pretty much start off with cloning the Git repo. So you clone the Triple-O validations repo since it's dependent on Ansible, you go ahead and install Ansible. So these are just kind of snapshots to show basically how easy it is to do. So Kevin mentioned about resources and undercloud is an all-in-one open stack and it does need some amount of resources to get the job done. However simple it sounds that you need to have the right number of CPUs and the right amount of RAM, it's unbelievable how many times I've seen colleagues, customers, I'm guilty of it, that we all kind of skip or conveniently look over for various reasons and try to get success and we land up falling and failing after many, many hours of getting into deploying and then you come up with all sorts of errors which you have to go on in debug. So it's really much simpler to kind of follow the recommendations and assign the amount of resources required by the undercloud. So these are just Ansible playbooks already available to just quickly check them. So this one is an example of a screenshot of checking the CPU cores. The next screen checks the amount of RAM that's there. So it's just easy, I mean it's just a good idea to have this as part of your process and part of your workflow. Yeah, I mean so the simple check up front takes a few seconds to run but if you don't do these kind of things you end up tracing through journal CTL to find out of memory errors or whatever it is you're tracing down. So it's just good to do that up front. So another example is for Syllometer where you want to check the default values for Syllometer. If you leave the default sizes these entries live in the database forever and eventually over time your undercloud disk space fills up and you're scrambling your head when you really need your undercloud up to manage something. So it's really well worth the time spent, a few seconds to get this foundation built, a strong undercloud with some amount of validations done. Besides these playbooks are already available it's really easy to use them and even contribute to them as you go ahead. NTP is another one. I've seen enough HA deployments fail because of an overlooked or un-configured NTP server. It's just a good idea to run this right in the beginning. So these are just some examples of how validations can be run and how easy it is to do and how triple of validations can help us at various stages. Preparation to prepare your host machine before you install your undercloud. Pre-introspection before you start introspecting your overcloud nodes. Pre-deployment that's before you deploy your overcloud node. Post-deployment checks default file descriptor values, et cetera for your overcloud. The easy thing to do is check in the Ansible scripts for the group values and check for which phase that script is relevant for and use that or go ahead and add some more and push it upstream. Yeah, so it's really nice to have this extendable set of validations and I did mention earlier about how some of this would apply to triple O-based deployments and some of it would not. So even in some of these, they're just Ansible playbooks. So if you could utilize them in a way that sees fit for how you deploy OpenSack, feel free to do that and they are contributable too, so let's extend them. So once we have the validations, we now have a cloud up and running and we need to know functionally are the services that are available gonna actually operate and return the results we expect and this is where Tempest comes into play. So as a practice now, Red Hat, when we go out to deploy proof concepts or production-based deployments, we are actually implementing this as standard practice to use Tempest to validate that all the things are up and running and all of them return results we expect. So Tempest is that functional test kit we can use. It runs against any OpenSack cloud and it only uses a public APIs. The nice benefit of this is that we don't really care under the covers how you got this thing stood up. We only care what's exposed publicly and that it's doing what we expect it to do. So there's two ways you can run Tempest. You can clone the repo itself and then install it via the docs the way they describe. The second method is to actually install Rally and then run Tempest from within Rally. And Ruchika, her preferred method is to do this and I'm gonna let her explain why as we go forward and then she's gonna show some examples of using Tempest. Yeah, the only reason why I prefer Rally is it just makes me deal with fewer number of tools. So I can run Tempest under the hood of Rally and that way I just deal with Rally. Anyways, so this is just a screenshot of how you go ahead and install Rally and how you go ahead and create your Rally version environment. The next screen actually shows you how to install Tempest. The Verify install stalls your Tempest test cases and the Verify starts running the default list of Tempest test cases. Then you can go ahead and generate a nice report in the form of an HTML file using the Verify results. This is a screenshot of what successful test cases look like and anything green which shows success generally is a very, very pacifying thing. It's good, everyone's happy, but I've noticed that the minute somebody sees a red failing test case, there's some amount of panic. I just wanna kind of use this opportunity to say that don't worry, it's pretty normal. I haven't seen a single customer where the entire default list of Tempest test cases has just kind of passed from the get. So it's okay. So I'll go over a little bit of a cheat sheet of how I go about failing test cases because that's a very common question I get. In this example, this test case is failing. It seems like it's an instance resize test. So the first thing I do is go ahead and run this test in isolation with the debug flag which generates a whole bunch of logs which might give me some idea about where the failure is. Then I go ahead and debug. That could be multiple things. Beg people, mail people, IRC chats, Google, et cetera. Eventually I figure out the problem and I fix it which in this case happened to be creating the Nova user on multiple compute nodes and having them share the same SSH key. All right, so I fix it and move on. So this is my little strategy for it. Plan for the Tempest testing. Don't leave it for the exercise you do two hours before you tell your manager that, hey, it's all done or it's about to go into production. Plan for it. Understand the failures. Figure out if you care about it. For example, I may not care about the EC2 test cases that fail which are plenty. Take some time to figure out and debug the test cases that are failing and always at least that's what I do and I find useful is run the test cases explicitly from a file. Don't run the default list. Put all the test cases you care about in a file and specify that file when you wanna run Tempest. In that way, when you have an upgrade, any kind major, minor, and you run the test against the same list of tests, you know exactly what failed and whatnot. I would go far enough to say use version control on this test file, et cetera, yeah. Yeah, so exactly what she said, it's really funny how many times I've been at a customer doing a proof of concept or with our consulting team there doing a production deployment and horizon comes up and everybody's throwing the hurrays and you're not even close to done yet. So like she said, you have to add in the time for applying these tools on top to make sure that when you actually go to put the workloads on, which is why you spun the thing up in the first place, it's actually gonna be able to accomplish what you were set out to do. Just seeing horizon is not success. So we've got the validations up front, which is great. Now we've got Tempest to do our functional testing, but a lot of my customers, they'll start out a smaller scale and they'll want to scale up over time and they've got a use case they've identified it first, but they wanna know when they do get up to scale and as their user base grows, is it gonna be able to handle that? I actually had a customer recently, we were doing a one compute node proof of concept deployment and we asked them how many users they were gonna put on this. They said 500 users on one compute node. I was like, yeah, I don't think it's gonna work out. So Rally in this case would have helped us show them that that's probably not gonna work. So Rally is a benchmarking tool that we can use. It answers a very simple question, how does open stack scale? It's a really nice question to know as we're building this up because it's the purpose of it. So you can use Rally to continuously improve things like your service level agreements, the performance of the cloud and also your stability of it. So everybody wants open stack to be stable and reliable in their enterprise deployments and so you can use Rally to sort of guarantee where those limits are. So Reticus is gonna walk us through some of the key concepts here with Rally and then we'll show an actual example which is a really nice use case that some of our customers have brought to us. So Rally Concepts. So the most important thing Rally helps you do is model your use case to some decent accuracy. So there are some variables I've listed on this slide which kind of show what are the basic things that are used to kind of help model the use case. So there's something called a scenario runner which defines the kind of traffic you want imposed in your cloud situation. The context specifies the context in which this traffic is imposed. In this example there's one tenant and three users. There's the concurrency which tells you how much parallelism you want when you come up with these scenarios and these scenario stacks. So using all of these things you can actually pretty much model the traffic and create a good test scenario and we'll go ahead and see how useful that is. One of the main things I find extremely useful with Rally is I don't have to do anything extra to write a Rally test case. If I'm going to be working towards my end goal use case I'm going to be writing application template of some sort, right? So I can actually use exact same application template and feed into the Rally and kind of see what it looks like and how performance looks like. So this slide set shows you an example of exactly that. There's a YAML file created which has the application stack defined and I created a scene.json file and I actually created a deployment which in my situation in this example defines a number of existing users and I go ahead and run the scenario file against this Rally deployment. So I'll walk over some of the examples and how I progress through using the same scenario to tweak and see how my stack performance looks like. So this is the first really boring case where there's just one instance, one heat stack being created with no concurrency really. So things look good, everything successful and I get a baseline estimate of how much time it takes. Then I get to the next boring case where there's, I bump up the parallelism to five and I create five applications and you can easily see how the amount of time taken for creation changes. Great, but it's still good, it can still work. The next one, this slide actually, I just wanted to put here to kind of as a word of caution. So as I start bumping up the number of instances created, there's something I need to be aware of and that's the quota. So I need to remember to kind of change the default quota values. So this example kind of shows you what to look out for and to remember to kind of pay attention to those details. So this example shows you the kind of error you get when you exceed your default quota allocation. So I fixed that and I go ahead, so as you can see I've shown how the scenario changes in definition, I go ahead and fix that and I run the same scenario again with the 10 stack creations. And now you can see that there's actually an actual error which shows that there wasn't enough resource. So you could use this, expand on this to kind of figure out how your end goal use application looks in your stack before you have real users on it. Yeah, so what I love about this example is that in this case it was a customer that had basically a student environment. So they had a heat stack to find already, they would deploy that for a student, the student would have access to the lab and then they would terminate it when it's done. So you can, if you have one student come in, great, we can do that. If we have five come in, great. What if we had a shot where 25 students wanted to lab at the same time and we spun those up, how is it gonna respond? So we learned two things out of this. The first is that my quotas weren't set right on the tenant and then secondly, that our limit was gonna be at 10 instances. So it's really useful for our customers to be able to take an application they're already working on automating and then use that to benchmark against. So a really nice setup. So again, plan for it. Use the application as is if you want, if you don't wanna do any extra things. Generate your results and don't forget to keep the same scenario for the next stage we're gonna talk about. So we have the validations up front. We now have functional testing and we have a basic benchmarking capability for our cloud. The next logical step is to be able to aggregate that data over time and be able to tune and optimize the cloud as we go forward. And that's where the newest tool that's coming into picture is Browbeet. So Browbeet is a orchestration tool for existing open stack workloads like rally. So we can actually use Browbeet to run rally and then aggregate the results over time to see measurement. It is an assistant for standing up a performance infrastructure that we can then use to tune. What it is not is a new open stack workloads. So there's nothing being reinvented here. It just utilize things that are already available like the metrics that are being brought in and then your rally use cases. So it's a really nice setup. So Ruchika is gonna take us through what Browbeet is made of and then what it can do for us on top of our open stack clouds. So Browbeet is really a set of Ansible playbooks. It really helps you create a performance measuring infrastructure. And by that I really mean it just doesn't give you information about your cloud. It really gives you a holistic kind of holistic information about everything about your cloud and I'll go over that in a few minutes. So anyone trying to tune anything needs to first measure, visualize, change and redo these steps a million times. I mean, poor guy or girl, whatever, but that's what happens. So Browbeet just helps you do exactly that and I'll go over my setup. So my Browbeet setup consists of the cloud itself. There's a server one with the Browbeet harness, Elasticsearch, Logstash and Kibana and another server with Grafite and Grafana. So Grafite is my database that collects the statistic information from my cloud and Grafana is my graphing tool. And just a disclaimer here, this is not a recommended architecture. I was scavenging for hardware. It's recommended that all these pieces have servers of their own but I had to make use of what I had and it worked out so it's fine. Yeah, so in a production deployment you'd want to dedicate some resources to these tools and expand out this architecture a bit. And it'll be well worth the resources spent on it. I forgot to mention I had Collegdi running on the cloud which actually sends back statistic data to the Browbeet infrastructure. So just some snapshots because I spent time to get this stuff working. So I thought it's worth putting snapshots in here for anyone who wants to use a slide set as they walk through their setup. Make sure Grafana and Grafite are connected. This is what a good setup looks like. This next slide actually shows you what being functional looks like. So make sure your elk is functioning. You should have a steady stream of logs coming into your elk stack. So you're pretty much good to go except that you need to have Elasticsearch enabled, Rally enabled, as well as Grafana enabled. And then now you can run any Rally scenarios you already have. You could have written them in the previous step. You may use the ones existing as part of Browbeet or anything, but you could use your Rally scenarios to run in this infrastructure for the purpose of tuning your cloud. This is an example of what Browbeet gives you. So in this graph I'm showing the same test case run against two different cloud tuning parameters. So the dark blue line shows it run with, say, Nova Worker, the number of Nova API workers to be 64, and the light blue one is for Nova Worker 16. And you could generate any kind of graphs for whatever parameters you wanna tune and care about. The idea is that you could measure how your scenario changed in performance between upgrades, between versions, between different tuning parameters. I mean, irrespective of what your purpose might be, but this is a great comparison tool. So remember I mentioned Holistic View. This is exactly what I mean. You already got information about your cloud performance. This, the Grafana dashboard gives you information about how each of your nodes is performing from a hardware, RAM, CPU, network interface, and there's a whole bunch of parameters. This doesn't do justice to it, but I kind of wanted to showcase that this exists to kind of see how each of your nodes is performing from a hardware perspective. So you get a performance info about your nodes, as well as your cloud, you can compare. And this far, we've been able to do validations, functional testing, performance testing, and tuning. The last thing I wanna talk about is Red Hat Insights. So this is a predictive analytics tool fresh from the oven from Red Hat, still in beta. And what this gives you is very actionable information about the stability, the scalability, the resource availability, and the security of your cloud. So it gives you a whole lot of information and exactly gives you very detailed information about what action needs to be taken to fix those issues. So what it does is actually runs a coordinator engine on the undercloud, which runs client code on all of your overcloud nodes. And all this information is securely kind of analyzed on a Red Hat website and run against a rules engine which looks for all sorts of problems and vulnerabilities. So the next slide kind of shows a dashboard of the kind of information you get from this. And the same sort of information is available in multiple Red Hat tools. But as you can see, there's an issue, it kind of flags as an availability issue. So it's just a snapshot of what you can see, what you can expect from Red Hat Insights. And more information, you can get more information from this mailing list, from the mailing list. So in conclusion, we've walked through and we said at the beginning, what we wanted you to take away is not necessarily all the dirty bits about how you get these things up and running and how you define your use cases, but more about which ones are available and which times they matter. So we started out with validations, which comes in the form of triple O validations for us. We have functional test kit with Tempest, which has been around for a while, but the standard practice use of it is really what matters, the planning for it, using it to apply on top, making sure that as you move forward, you continue to use it to make sure all your services are functioning the way you expect. And then benchmarking with Rally and that use case of using an example application you're already planning to run in your open stack environment. So using that to benchmark against over time is really, really useful for our customer base. The last is Browbeat, which is the newest one, which gives you that aggregated ability to visualize and tune and measure over time. And then the cherry on top is Red Hat Insights, which is does the predictive analytics, so you'll actually get information coming back to you, giving you warnings up front when something is gonna happen and go wrong. And what this leads us to ultimately is our high confidence we talked about and the happy clouds that our management can be proud of and not be so hard on us about that we didn't test properly and validate over time and benchmark. So we really appreciate you taking the time. There are still quite a few other breakout sessions from a lot of our really awesome Red Hat peers for today. And I did wanna give a quick shout out to my Yershetty who helped us out with the Rally testing. He did a really good job with that use case and bringing it to us to help there. So, and Ruchika for our hard work. So really appreciate it. Any questions at this time, feel free to ask away. Anybody? Sold then. Thank you guys. Thank you.