 Hola. Good morning. It will be afternoon in a few minutes. This session is going to be on analyzing performance in the cloud. And your presenters this afternoon will be myself, Nicholas Wako. I work for Dell, Dell EMC. And Alex Cross. And I'm a senior performance engineer with Red Hat, and I work primarily on OpenStack. Right. So today's session is basically going to be on tools. Tools for measuring performance in the cloud, tools for monitoring performance in the cloud. And then we are going to look at spec cloud benchmark as a tool for actually measuring and monitoring performance in the cloud. And then we also see how we can use those tools to characterize performance in the cloud. And then we'll have some tuning tips that we shall be able to bounce off with you. We'll try to make this an interactive session, but we do have a lot of material to share with you. It looks like it might be best for us to whiz through the presentation and then we can have questions at the end of the session. And then please, when you ask the questions, make sure that they are precise and to the point so that we can accommodate as many as we can. And the other thing is that at the end of it all, there is a raffle. Some of you could have picked the raffle ticket. Just make sure that you stay at the end to collect your prize. Otherwise, if you go walkaways, some of us might take it. Okay. Without much ado, defining the cloud, I think we all have, definitely we are all advanced and expert users of the cloud. But that's the reason why we all have different definitions and versions on what we think a cloud should be or what a cloud is. I know you all have your favorite definitions about what a cloud should be. But for purposes of this discussion, we are going to limit ourselves to the cloud as defined by the National Institute of Standards and Technology. And based on that, we will also be the cloud characteristics that we are going to call out will also be those as defined by NIST. So without much ado, let's get into performance measuring tools and Alex. Sure. So I'm sure a lot of you have heard about this tool. It's called Rally. It's an official open stack project. It can run as an app or as a service. So when you run it as an app, you're just running it as a CLI from wherever you install it if you want a benchmark from directly on your cloud itself if you install it on a controller or if you install it on another separate machine, perhaps you want the horsepower there. It provides verification so you can verify through Tempest. You can benchmark various control plane services. It provides some profiling. You can do reports through it. So to provide you reports through HTML or through JSON as well. You can assign an SLA to the benchmarks as well. So you can set up the SLA such that it'll stop benchmarking hopefully before your cloud hits the brick wall because once you hit the brick wall, obviously nobody likes to have to go back and try to debug how to fix it or how to get the cloud running back again. Lastly, it's highly pluggable. So the excellent thing about Rally is you can write your own plugins so you can use those plugins or you can use the plugins that are already in there as well. So next tool I want to talk about is Perfkit Benchmarker. It's an open source living benchmark framework. This was originally open source by Google. It integrates with many different cloud providers. So as you can see, 10 plus cloud providers open stack being one of them. It has many benchmarks, 34 plus benchmarks. There's a pretty large community involvement there. It uses the CLI tools for those existing clouds. So you would have to install those tools on the machine that you intend on running Perfkit Benchmarker from. It allows results to be published to BigQuery. There's actually an open PR to publish to Elasticsearch as well I saw. I think somebody has a talk here later this afternoon about that. The other major thing about Perfkit Benchmarker is it also captures the cloud elasticity with the benchmark results. So rather than just capturing the steady state of where your benchmark's running, grabbing those results and that's what you're getting to publish, it'll also grab the run times of the benchmark itself. So the time it took to provision the instances, the time it took to install the tooling it maybe needed. So that's nice because obviously with cloud you want to know what the elasticity is. So along with Perfkit Benchmarker there's a separate project as well that's open source and it's the Perfkit Explorer. So this is a dashboarding and performance analysis tool. It exists as an app that you can host inside of Google's App Engine. So in that fashion you need to have an account there and set that up. It only integrates with the BigQuery backend data store. There's multiple chart options you can compare run-to-run. You can pair different flavors. You can set whatever you want on the Y and X axes. So that's really nice because you can build whatever comparison you want to build there. Right. Cloud Bench. Cloud Bench is yet another benchmark driver and harness just like Perfkit and the others. So basically it's a framework that will automate your cloud-scale evaluation and benchmarking and really in its simplest form all it really does is it initiates the creation of instances and then it submits configuration plans how the test should be done to the cloud manager and then at the end of the test it will collect your logs and performance data and then it will destroy the instances if that is what you want it to do. And at a very high level it's basically it's got three drivers. It's got a baseline driver. It's got an elasticity and scalability driver and then it's got a report generator. And then it's able to download the kind of workloads that you want to use for testing and measuring your cloud. So another tool. One to bring up here was Browbeat. And this is a Red Hat tool. It's now under the OpenStack namespace as well. It's not an official project but it's under the big tent. It's really an orchestration tool so it can orchestrate multiple OpenStack workloads. It really assists with installation of the tooling as well as installation of a lot of the analysis tooling such as monitoring the performance like the system performance of your cloud so it includes playbooks to set all of that up and to have your entire over cloud or OpenStack cloud to be monitored. It also assists with the setup of our results analysis side so that's Elasticsearch and Kibana. So it really combines all of those tools into one area makes it a little bit easier for you to install all of that. So helps you with installing workloads, gathering metrics and results all into one tool. The other big thing is it really provides a whole bunch of dashboards as well and that's really big because if you've ever installed like Grafana you'll know that you have to build all the dashboards yourself and it can be sometimes complicated to understand how your metrics are being captured and what they really mean. It also includes dashboards that we have with Kibana for results analysis as well. So here's just an example visualization of the results we have in Kibana and in this instance we're comparing and I realize it's pretty difficult to see what's there but on the bottom axis we have is the concurrency rate and this is a set number of rally benchmarks that were run against this cloud and we're comparing UUID and Furnay tokens but rather than overlaying the two graphs on top of each other we just put them side by side. I found it to be very busy if I tried to put the line graphs on top of each other which can be done but you really need to kind of limit your scope of how you want to look at your results. On the systematic side we use Grafana so this is a visualization of Grafana showing you CP utilization, memory and then per process right there on Keystone as well and that's a number of benchmarks as well that have been run and you can literally see when each benchmark ran we do have a little bit of a sleep quesest period of about 5 seconds between each one. Okay So Speck Cloud IAS 2016 benchmark is a new benchmark that was released this year from Speck and what it does is it measures performance of infrastructure of a service for clouds it measures both the control plane and the data plane and uses workloads that real customer workloads and as you will see later on it uses workloads like K-Means YCSB and a host of several other workloads that are part of the cloud bench suite and then it measures mainly elasticity, scalability and provisioning time and there's a number of other secondary matrices that you can call out but those are the three that are basically used in order to publish your results So when you look at the way scalability and elasticity is viewed at least from the benchmark point of view it's kind of like you are climbing a mountain an infinitely high mountain and scalability is really how high you can go Ideally you just keep on climbing forever and elasticity is the steps that you take it's like every step that you take ideally should be the same Now in real life we know that at some point you will get tired and you will stop and then your steps will never really be the same and so that's the same analogy in the cloud is that you keep adding on you keep loading the cloud until you start getting errors and for elasticity trying to make sure that every as you add on the load performance remains consistent so basically that's the way in which scalability and elasticity is viewed by this benchmark so what it really does is that it measures application instances so it has got this notion of application instances first of all you have instances you have virtual machines then you collect a bunch of those instances and make sure that they work together in a cluster to run a particular workload that becomes your application instance so the scalability is basically how many of those instances that you can load and run successfully without errors and then measuring the scalability and the elasticity of the cloud under test now one thing that people sometimes forget is that they think that it should be a measure of instance density but it's not does not measure how many instances you can load on the cloud it's actually measuring how many application instances you can load big difference but you can use this spec cloud instances to individually stress the cloud under test and we shall talk about that when it comes to performance characterizations so the benchmark has got two phases there is the baseline phase where you are just running one application instance so for instance you have a K-means application instance you run it all on its own you have like five runs and the baseline phase is used to collect the kind of statistics and data that is used in the elasticity and scalability phase so in the baseline phase you just run one single application instance it will be K-means and then after that you run YCSB and then after you have completed the baseline phase you run the elasticity phase and that is where you start loading all these instances K-means and YCSB until such a point that you start getting errors or rather you have SLA service level agreement kicks in because you are not getting the right throughput and things like that which really brings me to the next slide how does it stop at what point do you show that you cannot continue anymore so the stopping conditions are if 20% of the application instances failed to provision that is a flag for stopping if 10% of them have errors that's a condition for stopping you can also set the maximum number of application instances just if you look at our knowledge you can say that after 10,000 feet no more so in this case you can also say I am going to run only 10 application instances and then after that the benchmark will stop and then if you have any QoS violations in other words your K-means takes longer than the threshold to complete YCSB throughput is probably lower than a third all those will be stopping conditions and the benchmark will stop so at the end of it all there is a report that is generated and that report will give you the primary metrics which are scalability, elasticity and the mean instance provisioning time and then it gives you a host of other primary metrics and the reasons why your benchmark stopped and all sorts of things and again all this information is available on the spec cloud website so right now there are two published results of the benchmark they are all published this year and they are all from Dell and we are expecting other companies to publish results on this benchmark back to measuring tools so one performance measuring tool that is included with OpenStack is the telemetry project most of you know it as a salometer it is obviously a major project so more recently in the telemetry side we have actually switched the back end to Noki, just wanted to point that out there so multiple applications can leverage salometer at this point to review performance metrics one thing that I have found with the telemetry services that we have inside OpenStack is they are generally focused a little bit more on billing than they are maybe from a performance engineering aspect of the granularity I might want to get out of my metrics might be a little bit a little bit a lot smaller essentially so another set of tooling you can use is collectee graphite grafana this is a pretty somewhat pluggable setup where you could potentially change graphite and carbon with a different data store if you wanted depending on how you want to store your metrics it just will affect your dashboarding so I just wanted to highlight this stack collectee is really the daemon that just sits at your nodes and it's going to collect metrics aggregate them to some degree normalize them and then be able to send them off to carbon carbon will then store them in its cache and write them to whisper files as well as keep it in the cache graphite is the web front end and you can use graphite itself without another dashboarding framework or another dashboarding tool it's difficult to use at times it's a very busy interface has a lot of items on there so one of the ways to make it a little bit prettier is I use graphana that's the screen scrape you can see right there you can see our visualization of CPU memory as well as disk on there so ganglia has been around for some time mainly used for hardware centric metrics we still use it although the trend especially when it comes to the cloud has been more to do with a collectee graphana and other tools but ganglia is still very relevant it's very scalable and it's also very relatively easy to setup and it tracks a lot of hardware centric tools and it's got a very low operational burden performance characterization so why are we interested in performance characterization mainly we want to be able to understand the behavior of the cloud and the Lord that's the main reason why at least at Dell why we do performance characterization because most of the times you may not know what happens as you stress the cloud and some of the things that we track for instance is provisioning time if I keep adding instances what is the impact on performance and particularly if those instances have workloads if they are running workloads how will they perform and the Lord so provisioning time the way we look at it is you have the instance provisioning time which is you request the cloud manager to orchestrate a bunch of instances so the provisioning time is actually the time when that request was made and the time when the instances are able to report or respond to a net cut problem port 22 now for application instances it's the time when the request was made to create the instance and the time when the AI or the application instance reports that it's ready to do business or to run the tests there are several tools you could use for characterizing provisioning time one of those very good tools now we use spec cloud for doing that mainly because of the reporting facility you can run a bunch of you can orchestrate a bunch of instances and then later on you will get a good report to tell you how long it took for provisioning time now if you want to look at the IO characteristics the first thing you have to do is to understand what the capacity of your IO is and there is a bunch of good resources on the internet on how you can do that these two sites that I have in the paper that can be good wait for you to know how to get the limits of your PCIe and then of your SAS controllers but basically to use spec cloud for characterization of the network and the IO what you can do is again through the tools through spec cloud you can vary the number of seeds for YCSB or you can actually increase the number of YCSB records and you can keep doing that keep loading them until you hit whatever limit that you can get and the other thing you can do is if you are using cloud bench cloud bench has got FIO and a number of you may have used FIO you can run an FIO test and that will give you and then you keep loading using FIO and that should be able to give you some good characteristics about your what about your network and IO cloud bench also has got net path and then the other thing is that it is possible for you to look at your management networks look at your data networks a bunch of monitoring tools to assess how your network is actually responding you can use ganglia, you can use collectd you can use linux tools a number of them are there it just really depends on which one you are most comfortable with CPU is quite important now you can also use spec cloud to characterize your CPU the recommended way is the K-means work load K-means is very very CPU intensive you should be able to load your cloud within a very short time just by running K-means and then orchestrating as many virtual application instances as many of them as possible so all you have to do is to vary the number of Hadoop slaves you can also include the sample size or the number of dimensions lots of opportunities for you to ramp up the CPU and one of the things that you will be looking at while you are doing that is looking at things like the CPU user time CPU system time, IO wait time IRQ time, all those statistics will be available to you and they by themselves will give you a very good picture of what on X within your system so you have to get the whole story together look at your network, look at your CPU to be able to know how your cloud is actually responding and the pressure now for scalability and elasticity that is exactly what spec cloud was crafted for it gives you a good result on what the scalability of your cloud is and how many application instances it could load and how consistent the performance was during the loading now this as you can see here scalability is 29.5 which is a number less score and then it shows you that you ran 20 application instances and then it lasts about 79% which is really a measure of ok by the end but when you were at 20 application instances your performance had dropped to about 72% so that's a measure of what that will show you and again you can vary the number of application instances and then you monitor what the how scalability and elasticity is impacted ok so tuning tips tuning the cloud and I'm sure since you're all expert users this is something that you do probably for a living but again this is really basically for us just to have a discussion of ok I'll just show you what we do or we have been doing and you guys I'm sure have your own experiences but one thing the way we approach it from there is you have to tune your underlying infrastructure make sure that it is really running the way it should run you have to run with your latest bios and firmware raves and you have to do the appropriate raid and j-board settings all those things are still very important even in a cloud environment and all the things that will give you the big performance at the OS and the bios level are still very important and we always make sure that those are done as part of the deployment process now this chart here just shows you I'm just going to show you what we did as a way of showing the power of optimizing your cloud so what we did here is that we ran a big data workload on OpenStack and first of all we ran that big data workload on physical servers which is in this case number one the blue one and that actually served as our reference platform so we ran on big data sorry only on on premise physical servers and then after that we ran on a cloud and we started optimizing up the point whereby the performance on the cloud almost matched the performance on physical servers and we started off from the point by default we started off at 0.19 of the bare metal performance and then with lots of optimizations we ended up getting to very close to that on bare metal so optimizing tuning those are very powerful things that I think performance engineers enjoy doing and they can actually make a very big difference to your performance and the next that we did was we had to get the right instance configuration and we actually we ran about five we tried about five instance configurations and settled on one that would give us the best performance and then we also managed to determine how many instances per physical server we should be running and we actually determined that if you run with four instances and if they are taking up as many taking up as many of the resources as possible that's your sweet spot then having done that we tried to see what we could do if we could get away with oversubscription oversubscription is something that is used quite a lot in the cloud one thing we found out that actually don't oversubscribe if performance is your key consideration but if you have to oversubscribe you can do that on CPU but not on memory and then the other thing that we did was to we have to decide whether we should be using safe shared storage or we should be using local storage so we found out that one if you are going to use safe you have to decide about what kind of replication you are going to use now if you have a high level application which has its own replication mechanism you might be better off turning off the safe replication that's something again for you to decide as a user or as an administrator but that should only happen if you have other ways of replicating your data now in this case we had we were able to use the applications replication mechanism and that actually gave us almost a 30% performance improvement just by making sure that we are running with safe replication one and then after that we switched to local storage cloud to use local storage rather than using safe shared storage and we got an additional 22% performance improvement so that was really one of the big game changers for us and then we attempted to use numa nodes and just made sure that we were making our instances to be aware of the numa topology well we only got a 2% performance improvement we could have got more out of it but we didn't have enough time to research further into why we were only getting 2% we figured that it had to do with the hypervisor overheads but again this is an area that we are going to research even further the other thing that we did was disk peening again making sure that the instances of the underlying IO that gave us a 15% performance improvement and after that point our performance was almost at par with the performance on bare metal so we were at 94% of bare metal performance having moved from 19% to 94% that shows you the power of optimization and tuning you can actually get there and we are very confident that there is a lot of areas that we can still go back to tune them and we should be able to match and if not exceed the performance on bare metal so on the control plane side of your cloud there one of the more major issues we found on the control plane side was inconsistencies in the tuning of your process or thread count of workers so in this example here highlighting one keystone worker versus multiple count all the way up to the logical core count of this particular environment and the red bar is one worker and then the other ones is 6, 12, 18 and 24 Kabana kind of mixed up the ordering so it's not exactly perfect but you can just see with the red bar how dramatically different the performance looks amongst the worker counts there and this is API response time so lower is better and in this particular visualization we have min, max, 50th percentile 90, 95th and 99th so you want to look among a wide range of percentiles as well as the min and max get a feeling of where the distribution of your response times are and some of the places that we've seen this is most notably keystone and keystone being your authentication process is going to be huge so you need that, it's called keystone because it's the keystone of your cloud so you ought to make sure that that's tuned and we've seen this as neutron workers glance workers and noki API workers as well another control plane side issues we've seen before has been uneven controller usage so actually the second controller in here so the middle graph is the one that has more CPUs and it's actually picking up more utilization from NOVA so it's picking up more of the jobs here and one thing I wanted to highlight with this you can start visualizing what the utilization is in your cloud at real time so you can start running your benchmarks or your testing and start getting that feedback instantaneously using this with triple O on the installation side we look at the installing just the over cloud itself as well and we've seen some significant usage of heat and you can also see the length of time it takes to provision as well where the line terminates when it finished but the main takeaway here is the more compute nodes we're provisioning we're seeing more utilization of memory so it was about one gigabyte of memory used for every ten or so compute nodes deployment timings we tried to optimize this as well compared to OSP8 it was much definitely not as well compared to this but you can see some of the tunings we tried and the differences in the timings as well where it shows one it means that that deployment failed or took too long for us to actually count that measurement so we just didn't even show it so it doesn't necessarily mean that it failed it means that we just at that point we cut it off and conclusion so you want to define what you're measuring you want to really figure out what metrics are most important for you what's your objective with your cloud you want to use a variety of tooling such as the tools we have there rally, perfkit, bench marker, cloud bench spec cloud, salometer you want to have tooling that measures your system metrics as well don't just take the results data and look at that without understanding what occurred because if you run a bunch of benchmarks and then you say hey I know Keystone can perform like this but you just isolated that component you don't know how it works as a whole so you really want to be able to look at both sides of this you want to look at the results data and you want to look at the system performance as well because if it does something totally unreasonable like consumes all of the resources in your cloud then you know that that's not really the point that you can hit because you can't support other services if they're shared amongst your nodes most importantly though you want to gather and analyze your data so don't just start blindly applying tunings make sure you gather the data on your cloud and your environment and your setup and you look at that data yourself as well and then apply tunings that make sense based on the analysis of that data here's some additional information and then I think we're on to Q&A now so I'll leave that up there if you guys want to so any questions? I think we do have set areas where if you have a question oh I thought there were some microphones there but are there any questions? okay yes please I'm sorry could you repeat the question what is performance on which by the way you want to stay behind to get your rafo so yeah as he looks or this one sorry what was the question again? what did you measure what was the performance comparison he's curious about how that's one okay so the one is okay now the benchmark that we ran was a big data benchmark from the TPC we ran it on bare metal physical servers and we got some performance then we ran it on the cloud the green is we were actually running it on local storage without numa nodes and the purple we were running it again on local storage using numa nodes and we used the bare metal performance to normalize the performance of the others right yeah right so these are again we are using just relative numbers rather than giving you the absolute performance any other question yes please okay right good but some of the things are actually dangerous to change and you only fight over much later so one of the examples that we had is we switched off ETT support so the extended page table support which in our benchmarks gave us a boost of I don't know if we were made or 10% but later on with the application it suffered big time but also when you like change shooting options you have to be very careful that you don't shoot yourself in the foot later on you have to rerun your application to make sure that the application doesn't suffer specific use cases in all cases we didn't notice this because of course a very specific use case and less than 5% of the workload actually ran into this yeah and other good tips typically what we do is this was an investigation now for this to be rolled into our no more reference architecture we have we go through a slew of QA testing to make sure that our solution and use cases are not in any way impacted and by the way this was this was just like the beginning of the of a series of investigations that we are going to be doing and we had very limited time to get this done we are likely to have another go at it and we shall now go into more detail because we are now more we have the experience of having gone through something and we are encouraged by what we saw the next time I believe our story is going to be different I don't think it is going to be matching in many cases just to show that we could get better numbers on the cloud okay any other question yes and then you next the what oh this meaning so basically what we are really doing is that we are making the instances to be aware of the disk topology and again there's a way in which you can configure it so that instead of them using the logical disks they are actually talking directly to the physical disks that's disk pinning as opposed to CPU pinning which was again you could talk directly yes ma'am I'm sorry so you're asking which tool to use okay well I'd recommend both you want the application performance and you want the system performance because you want to understand when I say application if it's an API service then you want to know the system performance would be the CPU utilization your memory utilization your disk what it is per each of those instances you want to know on your underlying hardware that's what I look at the underlying hardware well does that answer your question I think the tool she wants oh the tool so I would personally you're asking me an inflated question I would say browbeat and I would use the collectee graphana graphite system for system metrics capture and analysis by analysis it's really the end user so it's really capture and visualization you have to do the analysis does that answer your question okay alright now time for the raffle you can get out your raffle ticket and take home some goodies now go ahead alright number 67 where is it the whole number up here 67 anyone with number 67 going once going twice 94 nope alright did you guys get raffle tickets 74 nope 89 there we go okay we have a winner okay guys thank you so much you've been a wonderful audience thank you