 I'm Siva Balan. I work at GE Digital and I'm a performance engineer of the security services offered by the Predix platform that's developed at GE Digital. I'm here to, as the slide says, talk a little bit about how to do performance engineering and testing of services and apps developed on Cloud Foundry. It's basically the practice that we put into place at GE Digital because we develop our services on top of Cloud Foundry. I just thought I would share this with you to see if it helps even a couple of people. I think I made my goal here. So, anyways, I'm going to briefly walk you through the agenda. We're going to talk about why go through the hassle of doing performance engineering or testing in the first place and we'll look at some tools and techniques that we follow at GE Digital. I'm sure it's pretty common. We use open source tools and it's nothing proprietary to us and anyone can take this up and start using it. Then I'll talk briefly about how we do this and how we've integrated this performance testing with our CICD pipeline and how different is doing performance testing on apps that are deployed to Cloud Foundry versus apps that are on-premise or not deployed to Cloud Solutions. I've seen some customers who are actually migrating from on-premise applications to Cloud Services and the mindset of doing performance testing or performance engineering on apps deployed on on-premise applications are quite different from, especially when it comes to Cloud Foundry. I'm not going to talk about other offerings but at least on Cloud Foundry because we didn't go through the process and we found some interesting findings. I'm going to talk about these interesting findings that we had when moving from on-premise deployments to Cloud, Cloud Foundry deployments. Just a slide on key takeaways and we'll start with questions. So why go through the hassle, right? So if you've done performance engineering or even talked to performance engineers, you know that it's very hard to find non-functional issues. It's not like you run a test and then you find the results right away. So typically some are, it takes a few hours, some takes a few days to find, especially issues like resource leaks, memory leaks and stuff like that. So usually in a software development life cycle, in most developers or most project managers that I've spoken to, it's usually a checkbox that you say, okay, I've done my performance testing and it's ready to go into production. It's usually the last step in the process. But what you found is that it makes it very difficult to fix the issues if it's going to be your checkbox for releasing your product into production or for the customers to consume it. It also helps in catching issues early on in the software development life cycle and especially if you're following agile methodology, if you start your, at least from our experience, if you start your performance testing in sprint two or sprint three, it really makes a big difference with the amount of time the developers take to fix the issue and the quality of the fix. Let me tell you this, non-functional issues are not, typically it's not fixed the first time. So it goes to a developer, he fixes it and then you run the test again, I'm sure you're going to catch the same issue or a similar issue again. So it helps to give developers more time to fix the issues and catching it early on and running it as part of a CI CD pipeline really helps in this case. And also, I'm not sure how many of you woke up at odd hours to fix issues in production, I've done it. So I think it really helps in not having to wake up at odd hours. So some of the tools and techniques that we've used to do performance testing, let me start with the type of tests that we do. We start off with initially a capacity test where we take one, we deploy the app or service to one instance of Cloud Foundry and once we deploy it to one instance, we then test it to make sure the service is tuned for the most optimal performance in that one instance. For example, let me take a Java application as an example. So if you deploy a Java app to one instance, we tune the heap, we look at the heap, we look at the usage, we look at the garbage collection patterns, we increase or decrease the amount of heap or the container size depending on how often the garbage collection takes place. So we look at various parameters and tune the application to the most optimal performance for that one instance and once we nail the down, we then move on to the scalability test and then we start scaling out the number of instances and we increase the load proportionally. So let's say you start your one instance of app is able to handle a thousand requests per second or per minute and that's performing at say 70% utilization with a nice graph for your garbage collection. Now once you nail the down, then you can then move to three instances and you can increase your load to 3000 requests per minute and you should see a flat line where your throughput increases proportionally to the number of instances that you've scaled to. So that's the kind of test that we do on the scalability test. So we basically increase it to three, five or ten instances depending on the requirement of the service. Sorry, the endurance test that we do is essentially to run one instance or maybe three instance or five instance depending on how you want it to test. It doesn't matter whether it's a single instance test or a multi instance test but we run this typically for five days. So what this catches is typically memory leaks, resource leaks, CPU utilization and it tells you if there are objects that are allocated or not removed from the heap. All those issues are typically caught in endurance tests. And I can't stress the importance of running endurance tests because literally every release introduces some bugs that causes memory leaks or resource leaks for you. But the fact that you're running in Cloud Foundry, it hides all these issues behind the scenes. I'll show an example of exactly what this means. But let's say you have a memory leak in your application and it runs in production. It'll run for two days and your container will crash. But your Cloud Foundry is actually helping you to restart the container without even letting you know that there is a crash and you don't even realize that there is a resource leak in your application. And the only way to determine this is to really run these tests for tests for a longer period of time, making sure that they are running and you have a nice resource utilization. I'll show you a brief finding that we had on the memory leaks issue. Stress test. This is something that we do to make sure how your application recovers after we go through a spike. So typically we have spikes early on in the week or sometimes early on the mornings and auto scaling is not there yet. It's out of the box in Cloud Foundry, unfortunately. So if your application is set to run with three instances and you're running out of resources with three instances because of a resource spike, I'm sorry, a spike in the number of requests coming in, now it's going to go down at some point. And you're going to get alerted if there is a problem with your application. You're going to go increase the number of instances and it's going to fix the issue for you. But you want to make sure that, or at least you should be aware, how your application is going to recover once the resource number of requests goes down. So that helps in stress testing. Chaos monkey testing is something that we have recently introduced. Of course, this came out from the Netflix, as some of you might be aware. Essentially what we try to do is we randomly try to remove services or instances from Cloud Foundry. It's kind of difficult to target a particular instance in Cloud Foundry and destroy them. So what we try to do is we try to scale down when there is a request spike and we see how it handles the instances going up and down in real production. We're trying to do this in production. We're not there yet. We're still doing it in performance testing to make sure how the performance test environment to see how this works. But we're getting there slowly. And some of the tools that we use is we use Jmeter on Docker for load testing and we use New Relic for monitoring. Jalokia for JMAX. This is something that we'll talk about as well. Unfortunately, because in Cloud Foundry, the only ports that are exposed are HTTP ports. You can't really use traditional JMAX tools for you to look at the JMAX metrics, like JConsole, for example. So Jalokia gives you, it's an open source tool that gives you an HTTP wrapper on top of JMAX. You can use a call request to get output of your JMAX. We use ELK for persistence. And I'll talk about this as well. So this gives you a picture of how we typically do our performance testing. So what happens is when a developer checks in a code into the developer of the master branch, we have the performance test framework, checks out the code, it builds the code, pushes the code to the performance test environment in Cloud Foundry. And once that's done, then we have a job that kicks off a Docker container, one or many Docker containers, depending on how many instances of Jmeter you want to run. And these start pushing load to Cloud Foundry. So once that the load testing starts, we then have every app that gets pushed into Cloud Foundry is always bound to New Relic for monitoring purposes, because or else without it we're just flying blind. And we also have Jalokia, which so for the way we use Jalokia is that, as I said, it's an HTTP wrapper over JMAX. And we actually call this using LogStash. So LogStash keeps polling every 30 seconds into the apps in Cloud Foundry and gets the metrics and persists that in Elasticsearch, as you can see. Sorry, this was a picture I was going to show. So as you see, the Jalokia persists the results in the Elasticsearch as well. And the results of Jmeter also goes through RabbitMQ and that gets persisted in Elasticsearch as well. So LogStash, so essentially RabbitMQ is a subscriber for, I'm sorry, the LogStash is a subscriber for RabbitMQ and the publisher is actually Jmeter. So the results of Jmeter goes to RabbitMQ and RabbitMQ is a messaging bus that we use so that it can handle multiple test runs at the same time. And of course Neuralik is constantly monitoring your Cloud Foundry deployments. So typically we run these tests, capacity tests on a daily basis and the results are persisted and a typical graph looks like this in our case. So this is a dashboard that we built using Kibana. And as you can see, the first two boxes, the transaction response time, the transaction throughput come from Jmeter and all the other boxes, as you can see, like the memory used, file descriptors, all these are JMAX metrics that are collected using Jalokia and persisted in Elasticsearch. So we are now able to combine both the results of Jmeter, the results of Jalokia into one graph and you can show this. And it's because you have the capability to get the JMAX metrics, you can even have custom JMAX metrics published into this dashboard as well. It's again, it again uses LogStash, Kibana, Elasticsearch, it's all open source and it's nothing proprietary about what we do here. And the CI CD integration is something that we take it very seriously. So what we, the way we do that is we start with Jenkins, we use Jenkins for our CI. And the way we have deployed our performance or we have integrated performance testing with CI CD pipeline is that every time, so of course Jenkins uses, Jenkins gets the build from GitHub, it builds it, pushes it to our developer branch. And the developer branch is always the most current branch that we test on. So that's where we get the latest bits from. And we have Jenkins, each of the Jenkins slaves is deployed, has Docker installed in it. So every time we want to kick off a test, we have a nightly test for performance testing because performance test usually typically takes two to three hours to run. So we don't want to do this for every check-in on a daily basis. So we do it as a nightly run. So what happens is whatever the last build for the day is, we check that out, we build it, push it out to the performance test environment, and we run the test for at least three hours. So, and the results of those tests get persisted in the Elasticsearch, and of course Neuralik monitors the test as well. We also have alerts set up in Neuralik. So if there is an SLA breach in any of the test runs, we get notified from Neuralik. And what we also do is the reports are emailed on a daily basis or even weekly basis to all the stakeholders. So that way we make sure that everybody is aware of what the performance test runs are, what the results are, and if there is any regression in performance between on a day-to-day basis or even from build, we don't have it from build to build because we don't run it for every build, but at least on a day-to-day basis where the developers are aware of it. So that's how we have it built as part of our Jenkins continuous integration. Now, how different is troubleshooting in Cloud Foundry? At least because we moved our application from on-premise to Cloud Foundry, we were able to get some information on how the troubleshooting is different from these two environments. As I was mentioning, there is no access to RMI ports in Cloud Foundry, unfortunately. So we had to come up with a way to do that, and that's how we started using Jellokia. So Jellokia gives you an HTTP wrapper for JMAX, and you can do all JMAX operations using Jellokia, and this has helped us tremendously to collect JMAX metrics. Most of them are available by default in Neuralig, but in some cases, like for example, DB Connection Pools, and other custom JMAX metrics are not available, and you have to use Jellokia for that to get those values. And of course, as I said before, we persist those in Elasticsearch and we're able to visualize them. JVM crashes. So if there is a JVM crash due to Out of Memory, for example, it crashes the container. So you can't go back. If it's an on-prem machine, you can go back and look at your HS error pit file. If you're aware of that, it usually creates some stack trace for you. So you can look at the stack trace to figure out what caused your memory leak or memory, what caused your JVM to crash. But unfortunately, in this case, because you use ephemeral containers, you can go back and figure out what is causing the crash for you. So you rely on application logs that are, what we do is, our application logs are all persisted to Elasticsearch as well. So you have to rely on the application logs or CLI gives you the CF events command. I'll show you how that works as well, how we use that as well. And those are a couple of ways that we've used to see how we can detect what happened with the crash, for example. Again, traditional tools like J-visual VM or J-profiler don't work on Cloud Foundry. And we have to rely on APM tools like New Relic or AppDynamics to monitor the application during a performance test or even in production, for example. We have to use these tools on production as well to give you a good insight on what's happening with the application, how the resources are being used. And there are ways around it. And it is not easy, but it really works. Over the past couple of years, we've kind of figured out the way how to make our life easier with using these tools. So I'll just talk about an interesting finding that we had because as we moved from on-prem to Cloud Foundry deployments, one of the things we found was that so we were pretty much most of our apps are Java applications. So one of our app is a Java Spring Boot app. And when we deployed the Spring Boot app and of course we have performance testing that we do on a daily basis. And as we started doing our performance testing, we found that the container would start crashing in a couple of hours. You would see that there's always restarts that are happening. And we checked our code up and down everywhere, every bit of code, and we didn't find any leaks in the code. It worked fine. If you do the same testing on an on-prem machine, it would work fine. We deployed on our laptops, do the testing, no leaks. It worked perfectly fine. So the leak happened only after we deployed it to Cloud Foundry. So as we were looking through the JMAX graphs and the New Relic graphs, one of the things we found was from the start, from the time we deployed the app to the time it crashed, there was not a single full garbage collection. So that was pretty interesting. So why would the garbage collection not kick in, right? So that led us to believe that there is something that's crashing the container before the JVM decides to do the first garbage collection. So what we had to do was, after a little bit of investigation, what we found out was the app that we were testing was slightly heavy on the native memory usage. So what apparently happened was that your garbage, so let me show you, let me go back a little bit and show you a graph of how these things look. So let's come back. Okay. So I'm going to show you a graph that looks normal. So if you look at the heap memory usage on this, you can see a very nice garbage collection happening. This is the last three hours. This is actually a test running right now. So the last three hours, you can see it has done a couple of garbage collections here. Now I'll show you another graph where it crashes. Oh, not this. Where's the other one? I think I lost the graph here. Anyway, I'll pull up the graph a little bit later, but I don't want to waste time playing it. So essentially what will happen is it will go and you'll actually not see a garbage collection at all. So when that happens, it actually, one of the things we found out was that we had to adjust the memory profile. So let me go back to that. All right. So one of the things we found out was the memory heuristics that was allocated to the percentage of memory allocated to the heap was pretty much very close to the total size of the container. So I think it's 75% by default or 70% by default. And we were using, and there was only 10% of memory allocated to native. And it would, the native memory was using up a lot more memory than what the heap was actually allocated for. So in that case, what was happening was when it reached to a point where it has to do a garbage collection, it didn't have any more container memory left. So the container was actually running out of memory, not the heap itself. So one of the things we did was we used a memory limit variable, if you're aware of that. And we can actually set that. So for example, if we push our app with two gigabytes of memory, what the memory heuristics would do is it'll give 70% of memory to the heap and 10% of the native, 10% of the meta space. And that's how it divides up the total two gigabytes of memory. Now by setting this memory limit to say 1.3 gigabyte, I'm saying the memory heuristics to work on this 1.3 gigabytes and not on the entire two gigabytes. So then what happens is that you have a lot less memory allocated to the heap and a lot more memory available for the native memory. So that actually helped us to start seeing those nice garbage collection happening. So once a garbage collection started happening, the app just ran fine for more than five days. There was no issues. So as you can see here, I'll show you a quick... So if you look at... It's not... Let me escape out. Here it is. Okay. So there you can see on the top all the crashes that has happened because I had removed the memory limit variable and ran the test for a couple of hours and you can see all the crashes that was happening. And then I reset the memory limit back and you can see the test started to run fine without any issues. So this was one interesting finding that we found because of... And we could find this only because we were running the performance test in our CI CD pipeline and we were able to detect it much before we deployed this to production. We were able to get the memory limit in place. The other... Let me come back to... This. So one more interesting finding quickly. So again, it's a Java Spring Boot app. The container would crash after 12 hours. So we ran a test. We were running an endurance test and after 12 hours we would see the container crashes. And the time... We really did have full GCs in them but what we did find was the time interval between full GC would start going smaller and smaller and smaller and eventually it will continuously be doing a full GC. It would never have time to do SCAM and GCs at all. And this is the typical memory leak how it looks like. If the Java has a memory leak, this is typical how it would look like. So you would see a continuous pattern of full GCs and then it will start getting smaller and smaller and smaller and eventually it will just be doing a full GC. And we didn't of course find any code and any Java code, any leaks in Java code. But what happened was that when we started looking at the services that are bound to the application, the service that we were using to monitor the leaks was actually the culprit of causing the leak itself. So when we unbound the Neural Lake agent from the application and ran it, luckily we had the JMX metrics collected through Jalokia and it ran fine for seven days without any problem as soon as we removed this one. And then we had to report back to Neural Lake to figure out why the agent was causing a problem. So there was one particular version of Neural Lake that was causing a memory leak. So all of these couldn't have been identified if we weren't running any of those performance tests. So some of the key takeaways I would say for running a performance test are you have to start early in your sprints for running performance tests. And you have to go deep. Just running one test and checking a box is probably not helpful. Always use a good monitoring tool so without that you're literally flying blind. And as far as possible try to automate your tests with CI and run as many tests as possible. And you have to enable developers if possible to run performance tests on their own, which we are trying to NGE. And then you have to make test results accessible to every developer because they have to look at to see how their code is performing. And that makes a big difference. So that's something we've learned as well. So that's it. I think I have about three minutes for questions. If you have any questions I'd be more than happy. Yes, please. Jaloke agent runs on, so the agent actually runs on the application itself. So it's a palm dependency because of Spring Boot app. So we just have Jaloke in the palm. So as we deploy the application it gets included as part of the application. Yes. It's totally dependent on what application you're running. For our application it was 1.3 gigabytes. But for your application it might be much more native heavy. So you may have to reduce it even more. So we started off with one gig. We slowly started increasing to 1.3 gig where we found the sweet spot and we said we'll stick to 1.3 gig. We didn't want to go to 1.5 because we started seeing memory leaks at 1.5. So we scaled back and said 1.3 is a good number for us. So it really depends on your application. There is no one magic number. Yes. Sorry. I think he was before. Oh, we are still on DEA. I'm sorry. We are moving to Diego. But I was told we're not going to be given SSH access in production for Diego for security purposes. So I don't think it's going to work even in Diego for us. Yes. Oh, yeah. For testing, yes, that's possible. Yeah. I think it might be enabled. But I think at some point we found that in some cases unexpected things happen in production too. So I personally wish we had SSH access to it. But yeah. I think we'd get it for performance testing. Yes. Yeah. We experienced quite the same. Okay. Okay. Yeah. But do you see an option? You can go change it yourself. So what we have done is we forked the build pack. At least for performance testing what I've done is I forked the build pack and I changed the memory heuristic testing setting in the open JDK. Yeah. Well, there are some things you can do and some things you cannot do there. And I think one of the things the memory heuristics is something that we could do it from the manifest file. We have to go change the build pack for it. Oh, maybe it changed. Okay. But I'm not sure if that's a build pack problem. It probably might be something that you have to control on an application side. I'm not sure how the build pack basically tells you how much memory is going to give it to your application. That's pretty much it. It doesn't care about how much native memory your application is using, how much heap memory. So it doesn't control how much memory your application is using. All it controls is the memory it's going to give to your application to use. Yes, that's correct. Yes. I believe there is a way to... The only way I can think of is for you to come up with your own heuristics numbers. Because they give us the option to change it. So I think we should just go change it if we feel our application is not performing with the default numbers that are provided as part of the build pack. I'm not sure if there is any other way... I can't think of at this time any other way to make it better. So, yeah. Thank you. Sure. Any more questions? All right. I think we're out of time. Thank you. Oh, yeah, sure. I'm here. We can come and ask.