 Hello, and welcome to this talk about shift left performance system. My name is Hari Krishnan. I am a consultant and a coach. I have companies with cloud transformation, extreme programming, agile and lean. I've spoken at multiple conferences. My interests include high performance application architectures and distributed systems. So before we jump into the topic, let's understand the context. Why do we need to shift left and performance testing. Let's take a quick show of hands and which environment do you identify most of the performance issues. Not quite likely that you'll identify many on local machine, and even possibly on development environments. Most teams have worked with start identifying performance issues on staging and then a lot more on their production replica. And those issues that we don't end up identifying our users identified for us and we need to fix them. So what we're putting here on this chart is the more intense the red is, that's how long it takes to fix those issues. So this is not desirable right now why do we end up in this condition. Let's take a look at the usual setup of performance testing. The developer writes some code on a local machine goes through the development environment staging. At this point, the performance test that comes in, starts setting up the perfectest environment. And what does that involve writing test scripts, setting up the load generator, the agents, etc. Start sending out the load, maybe for staging, and maybe even to production. Once the tests are done, you generate report and share it with the team. Developer makes sense out of it, and then ultimately starts trying to fix them. Now what's the problem with the setup? Obviously the first one is if you're identifying issues in staging or production, you're already quite late. And what's worse is once you identify those issues, to fix those issues, you have to go through the same cycle, which means the cycle time for the issue is going to be quite high. And what makes it worse is the higher environments, including the perfectest environment are highly contested. Because it's not just one developer, there are multiple developers, all of them trying to figure out how they can fix their respective issues. So that's like a compounding effect with this current setup that we have with most of the teams that have worked with. So how do we solve this and how do we want it to look? Ideally, we'd like to identify most of those issues on the left hand side. Obviously, we cannot do away with performance testing in higher environments, but at the very least, we'd like to identify most of them on the local environment for all the obvious reasons that we already discussed. That should be easy, right? This test is sitting on the right hand side. All we need to do is take that same setup where the performance testers had as a developer, take the setup and start running against local machine and development. And we start depending a little lesser on the higher environments and on the performance making their job a little easier. But that's not as easy as it is said than compared to how it's done. Let's look at the challenges with shift left. The first one is obviously creating a representative performance environment on the left hand side or on your local machine or lower environments. What's the difficulty? The first one is obviously the production architecture itself. You have fairly complicated architectures on like cloud, multi-cloud, hybrid cloud. And on staging, it becomes a little less complex. And ultimately, by the time you have your developer environment, it's just a humble laptop. How do we equate all these environments and say that's something that's tested here will work in a higher environment? Not necessarily possible, right? Likewise, network topologies will differ. The latency levels, the firewalls and architecture will be significantly different. But that's not something we can replicate on local environment to a good extent. And moreover, both test environments themselves are fairly complicated with multiple servers to generate significant load. Trying to stuff that into a higher environment itself is pretty hard. And now putting it on my local machine, this is a screenshot of how it looks as a memory pressure on my 8GB MacBook Pro. That's not desirable, is it? And now all of these issues are fairly genuine. So with all these challenges, how do we shift left? So the first lesson that I took away from my experiences is to scale down. You can't go to C trials for every single design change. You have to figure out a way to scale down. But how do you scale down? Let's take two parameters, request to second and response time. Now, do we scale down with aspect ratio, which is physically saying I'll reduce RPS and I'll also expect response time. I'll sort of increase it and stuff it into the lower environment. Maybe I'll maintain the RPS, but I'll reduce the response time. I don't know. These are all various options, right? So what is the way to accurately scale down all these parameters? And usually it's not just two parameters, right? We'll have multiple. And how can we validate the performance KPIs accurately? And so this we cannot. What we can do, however, is we scale down the trend and we invalidate hypotheses. Let's understand what hypotheses and validations. The hypothesis is practically a statement that we make that, you know, something is going to work. And that's usually what we have to assume with a lot of software and product, right? We just have to believe we make a change and we believe that this is what is going to happen. And for such a statement, there's usually a verifiability aspect and a falsifiability aspect. We're going back to the old school lesson with hypotheses. All green apples are sour and you eat one sweet green apple. You practically invalidated the entire sour green apple theory, right? So you don't have to eat all the green apples and prove that they're also, you just find that one sweet green apple and you have just proved the entire statement. Let's see how that sort of connects with software and performance testing in our context. Let's say I have been given this task to achieve this particular KPI that says with exponential increase in throughput. I want my response time to degrade only at a logarithmic scale, right? That's the expectation. How do I scale this down? While I cannot scale the absolute numbers down, the trick is to scale the trend down. Can I take the logarithmic scale and see if that is valid at lower throughput? If I test it, it doesn't happen and I realize that the response time is degrading at exponential rate. So what I do is not bother going to the higher environments. Try and see if I can fix some of those issues on my local machine. Once that's fixed and maybe even I'm able to achieve better than what is expected. That's not conclusive enough for me to go to the higher environment and release this piece of code or feature. It's inconclusive, so I need to validate it. And I do that in the higher environment and I realize it actually holds good for some more time, but then thereafter it seems to fall off. So for that sort of scenario, I can always come back. But this is not all bad, right? I did identify some of the issues on the left and then only for the absolutely unconfirmable issues I had to go to the right. And that's your shift left through invalidation. One more aspect to this whole picture here is you're learning through falsifiability on the left hand side. You're understanding better about the solution that he came up with. Obviously, we don't know whether it's going to work and we're confirming that through variability on the right hand side. Let's look at one more example. This is one of my favorites for invalidating hypotheses. In production, the baseline KPI at the moment is we are at 80% CPU at about 10,000 RPS. Now with cash, we'd like to reach 1 million RPS at the same 80% CPU, fairly ambitious, but let's see what happens. Now, how do I take this problem and scale it down? The numbers are insanely high, right? Let's say I won't even fix the CPU, right? Some max CPU on my machine and I hit X RPS without the cash. I add the cash and I figure out that at the same CPU level, I achieve another RPS, which is Y. Now it doesn't matter what X and Y are. The least bit I expect there is some bit of difference between X and Y. Why is this important? Because adding a cash should make some difference, right? Many times I've seen projects I've worked with where we simply add a cash and then we realize much later that it never was working. It just happened to stay there as moral support, right? At least we'd like to identify some of these changes here. See the trend that there is a big difference between the cash being present and not being present and then move to the right. So that's how we evaluate hypotheses through the trends or some significant markers. So next challenge in our way is the capacity challenge. Like how do you get to fit performance testing within your sprint schedule? The difficulty is the developer starts writing a feature, completes it. And at that point, we know performance testing is going to take long because that's how it usually is. So what we do is hand it off to a performance system to take this feature through its basis. Meanwhile, the developer moves on to feature number two. Now at this point, we hand that also over to the perf tester, move into sprint two. And we start with feature three. By this time, perf tester comes back saying, hey, I identified that feature one is not as per the expected KPIs for perf requirements. The issues are there, but we cannot really look into it because we committed to two points velocity and then we finished feature number four. And by then perf tester has identified more issues. And by the time we are into sprint three, we are only fixing performance issues from sprint one and doing performance testing for sprint two. Typically, this is a very big anti pattern, which we usually call the hardening sprint, right? We achieved some velocity by throwing some of the pieces of work across the wall. And this never has worked in the past. This is typical where the testing is not complete in the future is called done. So how do we go about solving this problem? The first thing is to come to terms with what is the reality, right? If the performance tester and the developer collaborate, once the feature is done, they do the performance testing together, identify the issues and then fix them. Now that takes two sprints. And this is the best case scenario. Now, assuming we are able to fix the issues in the very first iteration that we identified. Now that's how long the duration of the performance test is affecting how long the feature gets to completion. So what do we do? We reduce this feedback. We reduce effort. We reduce complexity and we reduce repetition. And you automate all the way to resolve this problem. So reducing repetition. How do we do that? Let's start with the API tests and the first test scripts themselves. So the developer writes API tests, the perf tester writes load tests or performance tests. Both of them are very common, right? Like the three similar, they just generate requests that's practically their job. A slight difference is the API tests sort of verify or assert the response. And the performance tests, we don't really care about the exact result, but we measure the response time. But net-net, typically what would happen is probably write karate API tests here and maybe perf tester is using some other tool, which is a code gen tool and a completely different language stack. So that's repeat effort with duplication. The second evil that comes in is inconsistency. We cannot say that the first test is consistent with what the developer is writing as API test. And obviously there's a disconnect between the person writing the test scripts and the person who designed the system. So maybe we'll miss some aspects which we could potentially have broken in the system, right? How do you do, how do you solve this problem? Take a purpose. You take a perfectly good road car, chop it up, add some big engines and turbochargers and take it rolling. Likewise, take your API tests and convert them into perf tests. That's what some of the projects that I've been working with effectively be able to achieve. It helps you reduce maintenance and you are consistent by way of how you write the API tests and leverage them as perf tests. And also it's a coordinated effort and promotes a lot more collaboration between the developer and the perf tester. Reducing complexity. Now this is a fairly tricky one. The ecosystem around performance testing has come a long way. We have a ton of tools that are available. Once you've chosen your tool, then you have to pick up like some sort of a metric store, right? Because we don't want to see some test report at the end of the run and then make sense out of it. We want to gather the application metrics and the test metrics together so that we can visualize it together. On that note, to visualize you have multiple choices, Grafana, Kibana and the lot. Ultimately, the tooling and the infra and the instrumentation and orchestration of this whole setup, putting it together, is a fairly complicated bit. Add to this your constraints like cost and licensing, the perf test as preference, any other constraints and ultimately with great difficulty, we might arrive at one of the stack. And take that stack and install it on a higher environment. It's a hard enough task for the performance system. Now, taking that and trying to shift it left and putting it on to a developer machine, which is only laptop. The developers at a loss for how he or she is supposed to achieve this problem. So, let's see how we can solve this. Wherever you have fragmentation and complexity, the simple answer that usually pops is containerization. So that's exactly what I've been trying to do with some of the teams. We containerize the entire performance setup, so that we don't have to have specialized instructions for each environment. And we write or rather develop up after setup exactly like how we write code, and we promote it, we build it on the dev machine, promote it to the higher environments. Now, taking all of the things that we've spoken so far, let's look at some code. What we did is codify all of this knowledge into a framework. And we call it perfect. Let's look at how it looks. Let us look at a live demo for fees. For fees helps developers and performance test collaborate by providing a common performance testing stack that runs on local machines and higher environments alike. It leverages karate API test as performance test through karate Gatling integration without the necessity to write any scholar code. Let's cover this in a little bit. It also gathers your application metrics and Gatling metrics so that you can see it on a live dashboard in Grafana and correlate what's going on with your application behavior and load patterns. Now all of these pieces that you've seen the orchestration is handled by perfis and it runs inside Docker, which means you can install it on your local laptop or higher environments alike. All right, let's look at some code now. The purpose of perfis is as easy as downloading the zip file and extracting it to a location of your choice. I've already done that. Now I'm going to set the environment variable perfis home to point to the location where I've extracted it. Done that and that's pretty much all the setup I'll need. At this point, I need a demo project, which already has some karate API test, which I can convert as performance test. For this purpose, I'm going to use the karate demo project, which sits inside the karate GitHub. So I've already got the app cloned onto my local machine. I'm going to boot up the application now. And it has started. Let's verify if it's running. I'm going to host 8080 and greeting. Hello world, the app is running. At this point, I need to start integrating perfis with this application. How do I do that? I go back to my terminal. All I need to do now is since I already have the environment variable, I run perfis.sh init on this project. I need that perfis has added a configuration file, a YAML file, and another folder, which has some basic templates and configuration which I have covered in a bit. But primarily it has added this YAML configuration file. What I'd like to do is take the first feature, which is the greeting.feature file, which is a karate API test and convert that into a performance test. Let's look at how we can do that. So I open up my perfis.yaml. Now, perfis has dropped in a template. All right, let me walk you through this document from the top. So at the top, I have the features directory and then the karate feature file. Like I said, I would like to leverage the greeting.feature file. So these path parameters are helping perfis locate where the feature file is and create a simulation out of it. And the main element here is called karate.env, which says perfis. Now, why is this required? This is because karate config has a minor change in it, which I have done. I'll show you why. Because perfis is going to be running inside Docker, it needs to access a springboard application which is running on our local host. So for that purpose, I've created an environment and only set the host to host.doctor.internet. So this is something you could set to any URL where your application is running and it should be done fine. All right, let's get back to the perfis.yaml again. That covers the first four lines. Now, I can name my simulation to something more meaningful. I'll call it greeting because we are testing greeting. Now beyond that is the interesting part. Like I said, we don't have to write any scalar code to get this feature file to run as a performance test. The load pattern defined right here in somewhat of a similar manner as to how you define it in Scala DSL for Gatling. And finally, I have some URI patterns. For this particular test, I don't need URI patterns to be recognized or delete that. So there we have it. We are done. Now the next step for us is to boot up perfis. Why do we need to boot perfis? Because perfis runs inside Docker. At this point, my Docker dashboard is looking empty, but shortly the entire stack that I showed you in the slide deck will boot up right here within Docker. And done. So once it's booted up, it shows you that Grafana is running on localhost 3000. Let's actually check it out. Grafana is running the username and password is admin and admin. Inside Grafana, you'll notice that perfis has already dropped in a template dashboard. Obviously, there's no data inside this. Let's kick off a test to see how this dashboard looks. So in order to run a test, all I need to do is perfis.srch test command and it will pick up the perfis.tml configuration by default. So perfis now takes the creating.feature file and the load pattern we have defined here and generates a catlink simulation which will run for about 60 seconds as we have mentioned here. And yeah, let's look at the test results on the Grafana dashboard. So at the top, you'll notice that the left hand side pane shows you the catlink metrics, which is basically the total number of requests and the OK and the failures. At the right hand side, you'll see the percentile response times distribution. Now, obviously you can go and modify this. The catlink information is all available through influx to Grafana and you can generate your own queries. So that's the catlink related panels at the top. And at the bottom, I have the container metrics for the test, which is being gathered through Prometheus. The purpose of this dashboard is to demonstrate that you can look at the test results and the application behavior side by side so that you can correlate if with the load pattern changes how your application is actually behaving. So there you have it, a complete performance test that we wired up and caught it running in less than five months. So all this help with the shift left of performance testing. Let's take a look. Let's say you have your laptop and your API code sitting on it, along with your karate API test, your deploy your local environment could be localized. And that's pretty much your application. Now, let's say you install perfis, which is already inside Docker. You configure through perfis config amel and let office know how to generate the load treats it generates the, you know, the performance test out of it. Gathers the metrics from your local application get link presents it on a dashboard. And now you are able to analyze it and take immediate action on your local laptop. Now, once you're satisfied that your performance testing is to an extent complete on your local machine, promote your application to higher environments. And likewise, you could promote perfis to higher environment being dockerized. There's not much of a specialized setup that's required in higher environments. It pretty much leverages similar configuration files through a pipeline, if you wish, and runs the load test generates load against the higher environments, gathers the metrics makes it available to you for analysis. This is a longer cycle. So on the left hand side, you could do your performance testing right on your laptop identify most of the issues that shift left with that. And for those verification circumstances, you will have to move on the right. While we have discussed many things, the largest challenge in my opinion, still remains the mindset, we need to move from the performance testing sort of a thought process towards performance engineering. What do I mean by that, because of the word testing, we seem to associate testing to a verification activity that comes pretty much to the tail end of development. Instead, it should probably be treated more like a learning activity a series of spikes, through which we are able to learn about the decisions we're making and avoid guesswork. Right. All in all, what we would like to strive towards is become more scientific about how we are making architectural decisions or even as much as adding a couple of lines of code. Are we involving the right amount of rigor and, you know, designing experiments and trying to understand what's going on. So can we have some sort of a template where we can put all of these ideas that we discussed through the stock and sort of help push our thought process in the right direction. So that's what I call the continuous evaluation template simply puts your problem statement the baseline and target API and their hypotheses to start with. And then comes the important pieces which is your design experiments, and particularly, you split them up into falsified ability and very fabulous experiments in order to quickly invalidate hypothesis wherever possible. Record validated learns that come out of this experience for future lessons. Let's take a simple example to run through this template. I will again leverage the example that we saw earlier to improve the throughput of the application. Here we have a target to take it from 10 k rps to 100 k rps. The immediate thought process that appeared to me when I solving this problem is why don't I add a cache that should help speak things along. The first falsified ability experiment I came up with for this is adding a cache should create some sort of a change right. So you remember the, you know, with or without experiment AB experiment. So, at least I need to see that difference between X and Y. So, that's the falsified ability piece. If this itself fails, adding a cache makes no difference. I don't need to verify it. And I try to investigate the issue and I realize the miss rate is 100% on the machine, because of which I'm not able to see any discernible difference in the behavior of the application. And I further hypothesize that the miss rate is 100% because the TTL is too small. So I fix the TTL issues, and try the experiment again and this time it works. So I move towards very fabricate on a higher environment, understand staging, and the strategy I'm going to leverage now is to scale the trend. I mean, see that there is a 10x improvement that has been requested. So I'm going to try a 10x. That is pretty much put on the basis of whatever is the baseline KPI for staging environment. Now, if we cannot achieve that, we further investigate and we realize the miss rate and eviction rate are still fairly. One of the things we have is that eviction rate is associated with cash size. How can I prove this part on my local machine. Again, falsified ability. If I cannot prove that cash size improvement will help to move eviction. Can I prove that I can introduce eviction by reducing the cash size. What is the experiment here? If I can test that, I can establish there is some sort of a relationship between eviction rate and cash size. Now I repeat the very fabricate experiment that works, which means the validated learning has now put me in a position where I have some meaningful recommendations for how the production environment should be set up. Before I go too far, first I verify that deploying this change at least changes the CPU load and the current circumstances. And we have a cash rate rate just establishing that the deployment has gone through. That works. We can go for a full-blown performance test on the product replica to see if we can hit our target KPI, which is 80%. Sorry, the target KPI of 100KRPS at 80% CPU. That also succeeds the initiative. Of course, I've simplified this template and for the purposes of demonstrating it here, this list was a lot longer. But you get the idea. This template has constantly put me in a position where I have to think towards shifting the problem left by designing falsified ability experiments and depend less on moving to the right. So with that, I thank you all for being very kind audience in this talk, and I hope to stay connected with everyone through any of these mediums. Let's move on to the question and answers.