 Thank you for attending. Did you guys enjoy the summit? It's almost over, so always fast quickly, you know, three days and now one hour and then we have the design. What about Barcelona? Did you like Barcelona? I myself enjoyed the food, you know, the Tabas here is great. So okay, we have actually a very good topic today. Just, I am Magdi Salim, I have Muhammad Atani with me and I have here and we're talking today about rally. So how many of you used rally before? Okay, so this is like one-on-one. So be good to keep attending and attending the session and refresh your memory. That's great, but if you feel bored, I will not be offensive. So all right, so here is our great picture about the operator. So how many of us here are cloud operator? Just to know, right. So as our topic today, we have a few area we're going to focus on. We're going to talk about as a cloud operator, you know, what kind of the challenge we deal with it in the end today and the strategic challenge. We're going to talk about some benchmark tools, which is rally. We look into rally in deep dive about the design and then we how we install it and some of use cases, which is very much our agenda for today. So as cloud operator, we have our day-to-day challenge to deal with from data security, data accessibility, you know, our downtime, you know, do we have a DR site? Do we able to back up the data? But in a strategic level, we have also some other challenge, like when we establish our private cloud in first place. And as we maintain it after this, and sometimes we like to scale it. And sometimes we actually apply some, you know, patches here and there. So this type of phases, it brings some worry to us, like, okay, is my cloud is still the same? You know, is it still operating the same? And is do I have any performance issue? How I can detect this very quickly? And how I can verify that my scaling or the patches just apply, they didn't break everything else? And, you know, as we as we know, with the scaling, we can hit some problems. So as our problems are hardware, did we hit some limitation when we ordered this new nodes? Or, you know, do I have all my traffic still going to the same node and my high availability or my load balance is not working? Or maybe the developer, how many of us are the developer here? I am too, by the way. So did I write something in my code that's causing this problem that I'm creating a bottleneck in certain area? You know? Or maybe the way I decide to choose for my deployment. So we try to figure out a way that we are able to detect our problem very quickly so we can move on and reach our goal. So one of the ways to do this is to actually adapt some benchmark tools. You know, besides Rally, do you guys use any other benchmark tools? No, not a lot of them. So the benchmark tools is a great way to be able to monitor and try to take any performance issue in early stages. So why Rally? Rally is open source of project adapted under OpenStack umbrella. And for Rally, it's a benchmark tool that keeps historical data between different builds. Rally is able to make a deployment called multi nodes. It's able to verify benchmarking and profiling this deployment. Also, the good thing is, as we see, we have some developer in the room here that Rally is actually targeting developer QA, our QA department using all the time, engineer, and the cloud administrator. Also, with Rally, you can configure it to target any number of OpenStack deployment. So, so far so good? So, how Rally do it? As we know, OpenStack is a huge complicated ecosystem. It has multiple components. And how I can have one tool that can target this deployment and still measure it at the scaling level. So Rally has four main components with database. You have the OpenStack deployment engine. And if you really have your own instance, so you can just configure Rally to target this instance. You have your profiling and benchmark engine. That's where the engine is going to create a user data load and try to load the cloud. And you have the verification engine where it will run Tempest and it will bring the data back to verify your instance. And finally, you have the reporting engine. So that's great. So what kind of use cases is good? Or what kind of use cases that Rally usually we use Rally to target? So we target Rally usually to keep, as I said, historical data between different build. So I can, you know, it can give me insurance that my cloud is running and moving according to the plan. Also, Rally is a great tool to identify right away if I have any, what is my limitation for my current build or my current cloud? And knowing this limitation, I can actually use Rally to plan for the growth. And with this, I'm going to turn it to my colleague here, Mohammad, so he can actually talk more in detail about Rally. Cool stuff. Amazing how much things Rally can do. Scaling the architecture, finding issues with your code. A lot of stuff. Thank you, Magdi. So I hope you're excited to learn how to install Rally now and integrate it with your existing OpenStack cloud. So we'll go ahead and do that. We're going to show you how to install Rally, how to register an existing OpenStack cloud, how to execute a task and do some benchmarking with a sample example that Rally provides us. And then we're going to see how we can generate an HTML report. First step, installing Rally. Basically, you can install Rally by downloading the installer, install underscore Rally.sh from this location over here. And you just run it on your bash. And after the installation is complete, you do Rally manage dbcreate. And this will create your SQLite database under Rally database. So I have a question for you guys. How many of you have installed Rally? A few hands? How many of you had some challenges during installing Rally? Good. That's good. For the guys who didn't install Rally, I'm going to show you my personal experience. When I installed Rally, I had some challenges. So first, obviously I had to have the necessary packages. Your system might already have them. You might have some of them, but not the others. So you go ahead and install the necessary packages. In my second attempt, I was having issues with SSL certification. So my network was not accepting to connect with HTTPS to Rally. So I tried that with Wget. Didn't work. That was curl. Same thing. In the production environment, you would have to check this with your network or your system engineers to check how you can allow this. But if you're just installing it as me, like I wanted to play with it and know how it works, so you can have this workaround to bypass SSL. So you can download the script without running it. Just download it from Github. And then you go ahead and into the file and change the GIT URL from HTTPS to HTTP. You will have your installer now locally on your machine. Next, you add the execution rights to the script and you go ahead and install it. You're going to wait to download the packages and install it from Rally. In the end, you should see installation of Rally is done. It's a banner in the green. But wait, is it really done? Oh, as mentioned earlier, it's not complete yet. We have to actually create the database. So we do Rally manage DB to create so that we have somewhere to save all the time. Next, now we are going to see how to register our existing cloud with Rally. So first, we create a file, a JSON file. We put in the parameters to connect to our cloud. So we have the URL, the username, the password, the project we use for in our OpenStack environment. And the bottom, you can also here skip and secure. Or if you have your CA certificate, you can add the pass here in the bottom. Next, we have to do Rally deployment create. We specify this file we just created and we give it a name. And you will see in the status before the last column, the second before the last, you can see that deploy is finished. But now we still need to check if it's really complete because sometimes it says it's finished, but it's not really complete. So you do Rally deployment check. You'll see the services running. And for extra verification, you can do Rally show images. It will show you the images you have in your cloud that you just registered. And you can do Rally show flavors also. It will show you just what you have in your flavors. Okay, so now that we have connected, that you have connection to our cloud, we are going to show you a sample task. So when you install Rally, it actually gives you a lot of examples, a lot of scenarios that you can use as is or you can modify and use to test your environment or to do the benchmarking. So really the most basic one is for me is boot and delete of instances. So here you can see in yellow, it will boot the instance and then it will delete it. So it will clean it after it finishes. And we are doing it 10 times, which means you can see here under the 90 times 10. So it will initialize 10, no instances. It will apply this parameter to them and then it will delete them. You see on the left, I'm using force delete. Sorry, I'm not using force delete. And on the right, in addition, I do have force delete because I have not set it to false. And in addition to that, I am actually assigning a network to, assigning a subnet and a network to the instances. So you will see at the end result that the scenario on the right will take a little bit more time than the scenario on the left because it needs time to create the network and it needs time to, because it doesn't, it doesn't have the force delete us through setup or by default. So when I run this task, rally task start, and I specify the JSON file, you will see that it will have zero errors. So this is a good sign, zero errors for both of the scenarios that you saw on the left and the right. But the average speed, you will see that it is a little bit less for the one in the left because it doesn't have to create any network and it can force delete. So when you, when you finish the task, it gives you an option to also generate the HTML report. So with HTML report, you can see the same result that you saw on CLI. In addition to that, you also can see graphs. So graphs like at this time, at this time, and this, and details, more details. So to generate the report, you will do rally task report and the ID of the task. This will also be shown on the, you finish the, the generation of the task on the command line and you specify where you want the output. And by this, I want, I will hand it over to our colleague, who is going to show you the report. And he is going to explain more about the parameters that you saw when, when we have the sample. Perfect. Thank you, Mohammed. Hello, everyone. So for those of you that haven't seen a rally report before, I'm going to show you just the report that Mohammed generated. So basically, you might remember that he had two different scenarios within the task that he executed. We have them here, the two of them. Basically, we see the load duration, full duration, number of iterations, and if we had any errors or not, I'm going to go just to one of them and I'm going to take a look at what rally provides us. Basically, we have load duration. So this is the time that rally actually takes to execute the task. Okay? In this case, it is 87 seconds. We have the full duration. So this includes the creation of all the resources that rally is going to use. So basically, every time rally executes a task, it's going to create a number of users and tenants to execute the task specifically. So this time includes that time and also the time to remove all the resources that rally has created. Okay? Here we see the number of iterations again, number of failures here in this case, more detailed information about the durations, like the average, minimum, maximum, et cetera, a graph with all the iterations. As you can see, we see 10 in here with a specific time that each one of them took. Then we have the load profile. This is the number of parallel or concurrent operations that are running at the same time. We see that we have more or less two at all times. And then a little pie chart showing us the number of errors and the number of success. We have another tab in the report which is called Details. Here we see the same graph, but as you can see, we have two colors. One is for the NovaBoot server operation. And then the one that is a little lighter is the NovaDelete operation. Okay? And then at the bottom, I see the same pie chart, but with the two different steps within the task. The boot server and the delete server. We can see that the boot obviously took much more time. 85% of the time was invested in booting the instances. And then at the end, we just see the input task that we provided. Okay. So with this, let's go back to the presentation and let's continue with the content. So basically, we're going to go now a little bit deeper into what are the different fields that we can see in a task. And then I'm going to show you also a couple of scenarios of how we can use rally in real life. So please put on your diving goggles because it's going to be a deep dive. I'm serious about it. Okay. So task fields. We have a total of four different fields that we have to fill out every time we specify a task. The first one is the arguments. Okay? So these are a specific, a scenario specific arguments. So this will vary depending on what type of task we're executing. So for example, if we are doing a boot and delete task, we will have to specify what image I'm going to use to boot the instance, what flavor I'm going to use to create that instance, and things like the network that I'm going to be booting those instances into. If for example, if I'm doing a sender operation, we will have to provide things like the volume type or the size of the volume that I'm creating. Okay? Then we have the runner. So the runner, we have four different types of runners. We have constant, constant for duration, periodic or serial. So basically in the constant, we're going to have a constant load running for a fixed number of times, and then we have a concurrency parameter that allows us to specify how many of them are going to be allowed to run in parallel. Then we have constant for duration is the same as the one before, but only for a specific duration that is controlled by the duration parameter. Then we have periodic. So basically this one is a scenario with interval between them that we can specify via another parameter that is period. And then the serial which obviously execute different operations serially in a single thread. The next field is context. So basically this is where our rally task is going to be executed. So here we can specify several things. The main thing is how many tenants and users are going to be created to execute this task. Okay? So we can say, hey, for this specific task, I want rally to create 20 different tenants and one user in each of those tenants to work in parallel against my cloud to test the performance. By default, rally creates those users, but I can also specify rally to use existing user in the cloud if I want to just stress a specific number of users. One other very important thing that we can do in the context is to extend or narrow the quota. So when you use rally, you will see that if you, for example, want to create 100 volumes with Cinder, you're going to run into problems with the quota because a user obviously is not going to have an unlimited quota. So what I can do in this context is extend the quota, put it unlimited, for example, or narrow it if I want to. And the last field within a task is the SLA. So basically rally allows us to specify a service-level agreement within the task and rally automatically is going to check if that SLA is being breached during the execution of the task. Okay? So this allows us to, for example, say, hey, for my user, the acceptable time to create an instance or to delete an instance or to create a volume is a certain number of seconds. So if that value is breached, please flag it in the report so I can see if my cloud is behaving as I designed it. Okay? So as several SLA parameters that I can specify are maximum and minimum failure rate, maximum seconds per iteration, or maximum average duration of the task. Okay? So we're going to take a closer look and zoom in into a task and see at each of the fields individually. So basically, this is a very simple sender volume create operation or task. First, we have the arcs. In this case, we just specify the size, which would be one gigabyte in this case. Then we have the runner. We selected a constant type for the runner. It's going to execute 30 times, as you can see there, and we're going to allow five of them to be concurrent at any point in time. Then we have the context. So as we said before, in this case, we're going to create one tenant and 10 users within that tenant. And also, as we were mentioning, we're going to modify the quota and we're going to say that the number of volumes in the quota is going to be minus one, which means it's unlimited quota. And then we have the SLA. In this case, we specify two different SLA parameters, the failure rate. So we're going to allow a maximum of 1% failure rate. And then the maximum average duration for a volume creation is going to be 60 seconds. Okay, now I want to talk a little bit about the different projects that are supported by Raleigh. As you can see, of course, all the core projects are supported. And then, mostly, all the other big 10 projects are supported. You may see that some of them are not there. So only the most mature projects are supported right now in Raleigh. But it's basically the majority of them. In this slide, obviously, I don't want you to read each one of those little thingies. This is just to show you that every single operation that a user can do in OpenStack, it's already defined and coded in Raleigh. So, for example, anything that a user can do, like create a VM, create a volume, create a snapshot, attach a volume to a VM, attach a floating IP to an instance, create a subnet. All those things are already coded and there is a task available for you to reuse and test that specific operation from a user's perspective. Okay, so let's take a look at two different real-world use cases. So the first one we're going to take a look at is identify the limits of my cloud. We're going to look at an example that is how many concurrent users can I have creating VMs in my system. To do that, we're going to do four things. We're going to first create a task with typical customer operation, in this case, boot and list an instance. We're going to define an SLA for the task to match my user expectations. We're going to define a failure rate and then we're going to execute successive iterations of the task increasing the number of VMs and the number of users until the SLA is breached. This is the task specifically that we're going to use for our test. It's a boot and list server operation. You see at the top one thing that is new today is the ability of passing parameters to rally and make my task as environment independent as possible. So, for example, in this case, I'm defining two parameters, the flavor name and the image name. If the user doesn't specify any parameters in the task, when launching the task, I'm going to give it a default so as we can see there, I don't get any flavor name in the task execution. I'm going to just use v3 medium. If I don't get any image name in the task execution, I'm going to just use 034. Then we have the arguments as this is a boot and list server operation. I have to specify the flavor, the image name, and then there is another parameter called detail true. So, this is the type of list operation that I'm going to execute. That rally is going to execute. This is a detail list operation. Then we have the runner. So, this is a constant runner. I'm going to start with 25 concurrent operations. As we see, we have 20 times 25 iterations and 25 of them concurrent. So, basically, all those users are going to be pushing, are going to be creating instances at the same time. Then we define the SLA. We define three different SLA parameters in here. 60 seconds is the maximum duration, maximum seconds per iteration that we're going to allow. We're going to allow a maximum failure rate of 1% and a maximum average duration of 60 seconds as well. Then the users, as we're creating 25 VMs concurrently, we want to use 25 tenants with one user each. Okay, so this is the first run that we did. So, as you can see, we executed 25 instances concurrently. The average time to boot that instance was 28.4 seconds and all the SLA's went fine. We see all green. As you can see there, everything passed, no problems here. Our cloud is handling the workload without any problems. The second run is the double. So, it's now 50 instances that we are executing concurrently. So, the average time, we see that it has increased a little bit by five seconds in average. So, we see now a 33-second average time to boot and list an instance. But still, our three SLA's are passed. We have no issues with the numbers that we have provided for SLA. Now, we work with 100 instances. So, now we see that things are starting to get a little bit uglier. So, again, we see that the average time increases now by eight seconds in average. And then we have one of our SLA's has already been breached. We see that the maximum seconds per iteration is 64 seconds, which breached our set value of 60 seconds. We can see. Just a side note here that this was a home lab environment. It's not in production. So, production you should be able to go more than this. Yeah. This was yesterday to show you guys what you can do. So, yeah, again, one of the SLA's failed. And we're going to go with 200 instances in this case. Obviously, in this case, our average time to boot and release the instance when through the roof, 74 seconds in average. And then the maximum seconds, maybe you cannot see, but it's 124. So, double what we specified. And then the maximum average duration also breached the number of we were looking for. So, in this case, we could say that 200 instances is the maximum that our cloud could handle. And then the second use case that I want to talk about is that verify your cloud functionality. So, every time you deploy a cloud, an OpenStack cloud, or you do an update, you want to make sure that your cloud is still fully functional. So, what you can do with rally is you can create a task that contains many of the typical scenarios that your user would execute. So, basically, it can create users, tenants, networks, routers, subnets, et cetera, instances, and all those things. And rally will provide you with an output that you can see here on the right, which basically will give you a checklist of all the things that have been done and if they have been successful or not. So, it gives you really a single pane of glass to see how your cloud is behaving and if it has all the features that has been designed for. So, with this, everything that we have told you, I think it's a great tool that you guys can use. So, go ahead and try OpenStack Rally today. This is some of the documentation that we have used for this presentation. Especially important is the rally reference, in my opinion, which gives you details about each of the tasks that you can execute and all the parameters and fields that have to be specified in there. And thank you so much for your attendance. Any questions? Right, so what we do basically is in our development teams, what we do is every time we make changes to our OpenStack code, we execute rally tests that are standard to verify that everything is okay and no bottlenecks have been introduced in the code and we can make sure that it's scaling and it's designed as we intended for. So, obviously rally introduces loading to the cloud. So, sometimes it can be something we have to be careful when we put load in the cloud and it's in production because we can affect other users using it, of course. One of the things that we use rally for in our environments is that we use them for longevity testing. So, we have modified slightly the code. So, by default rally, as we explain, deletes all the resources that it creates. So, what we have done is we have modified the code to remove that deletion operation and all the things that rally creates stay in the cloud and stay running for a longer time, for longer periods. So, we can check that our cloud can have a good longevity. Perfect, guys. Thank you so much. Just before you go, guys, we have a raffle here. So, the lucky guy is going to win something. I don't know what it is. Is one winner or two? Just one. I like this number. Three or three? Three is zero, three. We have a winner. Congrats. You can keep this. Can make you good luck.