 Hello. Welcome to this session of Art and Science of Optimizing Cloud Native Java Applications. I'm Siva Balam. I work for G-Digital. I'm a performance engineer with the security team at G-Digital. Before we start, I think we have to go over the fire code announcement. And here it is. I'm sure you're to memorize by now. So let it be there for a couple of seconds. So in this session, essentially, we've been running quite a few services, Java services, on Cloud Foundry for the past three years. And we have learned a lot of ways to deploy and not to deploy, run, and not to run Java applications on Cloud. So this session is essentially to give you some idea as to the lessons that we learned on how we optimized our Java applications when we ran or when we were running it on Cloud Foundry. So let's dive in as to what we're going to look at. So we know that just a brief introduction of what does Cloud Foundry gives us. It makes it very easy to deploy our applications. We all know about that. And it makes the dependency management a breeze as well. So we don't have to worry about not having dependencies when the application is coming up or when it's actually running in the Cloud. And it makes very easy to monitor. And I'll talk about it a little bit later on how easy it is to monitor. And it also makes sure the app lifecycle is taken care of, your health checks are taken care of if your application crashes. So it brings it up very quickly and effectively as well. And of course, the health checks and restarts are done automatically for you. So just a background on what Cloud Foundry offers us to deploy our Java applications as Cloud-native Java applications. Now, how do we know that your application is optimized to run on the Cloud? So some of the things, I would call it a checklist that we typically look for is, how is it performing in the Cloud? Is it receiving the request that it's supposed to? Is it processing the right amount of time? How is the response time looking like? And how is your containers behaving when the application is running on the Cloud? And how is your garbage collection? So especially when you run Java applications in the Cloud, one of the key things to look for is, how effective are your garbage collections? Further down on the slides, I'll show you a couple of scenarios where we'll be down into some very interesting scenarios in terms of garbage collection. So how is the connection pools managed? When I talk about connection pools, specifically I'm referring to your database connection pooling, I'm sure most of the applications will talk to some type of a database on the back end. So managing connection pools and database was a big challenge for us. So I'll tell you why as well. And how your threat pool managed? So when you talk about Java applications, we have to talk about threat pools. How do you effectively manage your threat pools? How do you optimize your threat pools to run on the Cloud? And how is the container resources used? So we typically don't, especially developers, don't give a lot of thought about how the container resources are managed when they deploy their Java applications to the Cloud. From a performance-in-year standpoint, one of the things I look for is how effectively are the resources managed? How is your CP utilization on the containers? How is your memory utilized? And we don't have to worry about disks because these are ephemeral disks, so we typically don't store anything in the disk. But memory and CP are something that we need to watch out for. Let's take it one by one. In terms of app performance, so first and foremost, I always say to every service owner is that never fly blind. Always, always have a monitoring tool in place for you to monitor your service. In our case, you can pick anything of your choice, neural, like app dynamics, dynamic trace. There are quite a few of the tools that are available that easily lets you bind the monitoring tool to your service in platformery. They have service brokers that are available for you to offer as well, so please do make sure that your service is monitored right from the time it's deployed. And we have to collect some key metrics. This can either be done through the monitoring tool or you can have your own scripts or services or apps that would actually collect the services for you. Some of the key things that I would suggest is heap usage. How is your heap being utilized by your application? How is the garbage collection pattern looking like? And how is your threads usage? So how many number of threads are being used? How many active threads? How many idle threads? So these are some key metrics you need to collect irrespective of whether your application is running fine or not running fine for the matter, but I would say just start with these basics metrics that needs to be collected. Now just to give you a background on what the garbage collection should look like and what it should not look like. So in our case, this is a specific screenshot of one of our services. So which one do you think is more optimal? One on the left or one on the right? So I would say one on the left, right? So essentially it's the same service but using different bill packs, we have noticed different patterns of garbage collection. So you can see the sawtooth pattern on the left is actually the most optimal one where it's actually nicely collecting the garbage. The garbage collection is kicking in at the right amount of time, at the correct frequencies and almost the same frequency every time. But on the right, you can see that the committed heap is slowly going down which means it's a clear indication of your application having a memory leak. And you can see that the garbage collection is happening in the more frequent interval as well which also says that it's trying to collect less and less object every single time. So these are some things you need to watch out for when you deploy applications to the cloud. And just to give you an idea, the same slide that we saw before versus this one, you can see the garbage, the full garbage collection happens very periodically in the first one, but the second one you can see that it's occurring more often and it's also taking more amount of your CPU cycle. So you can actually see the garbage collection in terms of the percentage of CPU that's being utilized. So you can see that you don't want your application to be something on the right. So you need to be something on the, the pattern should be something similar to what you're seeing on the left. And this might not be very evident when you're actually running your application on your laptop or in your local environment. So you would start, in most cases, you would start seeing this behavior once you've deployed your application to the cloud. Now, as I said, we need to look at the frequency of the garbage collection, especially the full garbage collection, and we also need to be very conscious about how long it takes for the full garbage collection to kick in. One of the reasons I'm saying this is because it also depends on how you size your heap when you deploy an application as well. You don't want your heap to be too large or it shouldn't be too small. What happens when it's too large? So of course you'd be able to space your garbage collection at much less frequency, but at the same time it would take much longer for it to collect all the objects when the full garbage collection kicks in. So you need to be able to find a balance between how large your heap is going to be and how often you want the garbage collection to kick in. And so make sure that you're conscious about looking at the time that it takes to do a full garbage collection in this case. Now you also need to look at how many objects are being collected in each of the collection and you need this to be very consistent across the time interval that you're looking at. Let's say you're looking at your service for the past 24 hours. How is your garbage collection looking like? So if you have time at the end of the session, I'll show you a live graph of how the garbage collection looks like over a period of 24 hours or three days. So you get an idea. But be conscious about how many objects are being collected as well. Database connection pool management. So this is another key factor that you want to be conscious about as well because especially if your application is using Spring Boot, the Spring Boot reconfiguration, whether you've noticed it or not, automatically reconfigures your max active connections to four. I'm not sure why it does that, but this has been a very painful problem for us. And this happens in respect to whether your application is a production application or a development application that you push to the cloud. So one of the things we had to do in our Spring Boot applications is that we had to override this auto reconfiguration and actually set the max active values ourselves. And please watch out for this. So we have been in situations where we ran out of DB connections for no apparent reason. Another thing to look out for is you want to size your database connections of your Java applications based on the number of instances you're going to run. For example, your database itself has a fixed number of client connections that it can accept. Typically, if you take Postgres or MySQL, you probably have about 100, maybe 500 or 1,000 connections at the most. Let's say you size your database instance connections to say 25 connections per instance. Now, when you start, say, 10 instances, you're looking at 250 connections. You start 20 instances, you have 500, and at 40 instances, you're maxing out the number of connections that the database can accept. So when you size the database connection pooling, you need to be aware of what the max connections are or what the min connections are. So the minimum connections are the ones that it actually starts your application with. So typically, what we do is we have max connections around 25 to 50. Min connections probably are around five at the most because your application shouldn't be using a lot of database connections. If it is a transactional application, it probably would use one to two connections at any given time. So if it actually uses more than that, then there is a problem you need to look at. So what kind of problems might come out as you're actively monitoring idle and active connections? One of the things that might happen is you may want to watch out for long-running queries in them because the reason for the database connections to be held on for a longer period of time is because your queries are running slower and as more and more throughput comes into your applications or it's going to at least look for connections and hold on to the connections for a much longer time. So watch out for them, tune them and make sure that your typical transaction query doesn't take more than a few milliseconds. So that's typically what you need to look at. Anything that takes more than 100 milliseconds, I would watch out and make sure if I can tune them. And there might be some situations where your queries might have to run longer. That's perfectly okay. And I just want to make sure, I mean, what I want to emphasize is that you want to size your database connection pooling based on how your queries are going to perform and make sure the long-running queries are tuned. The next one to look out for is threat pool management. This is specifically the Java threads that your container, either a Tomcat or a JD container that you deploy to. How much does, so you need to understand what the default number of threads that your container comes up with and how much are the active threads and what the idle threads are. So typically, if you deploy your Tomcat, your application deploys on a Tomcat container, you would see about 200 threads by default. So unless you go ahead and tune them, 200 is what the default number is, and that's perfectly fine. But what you need to watch out for is, again, monitor how many threads are active and how many threads are idle. So you want as less number of threads to be active as possible at any given time because your threads are really, they should be short-lived in the sense that your thread should not be held on for a long time, especially when it's running transactions. Having large number of idle threads is perfectly okay. It just takes up a memory, but make sure that it's not... So one of the things to look for in terms of idle threads which I've not mentioned here is the stack size. So there is an XSS parameter that typically gets allocated when your Java application is pushed, and this parameter is allocated on per thread level. So you need to be conscious about what the stack size is for each of the threads and tune the stack size based on the amount of threads that you have in your application. The other thing I would look for, again, using your monitoring tool to see how your thread usage pattern is. So maybe when you're looking at the past 24 hours or 48 hours, how is your thread being used? How many active threads? When does it go up? When does it come down? Or is it requiring more than 200 threads by default? So these are some of the thread usage pattern that you want to watch out for if you want to optimize your application running in the cloud. Now, one of the things we have also come across is stuck threads. So I'm sure many of you have come across this as well. And a lot of these monitoring tools gives you options to do thread dumps. And I think if you're using Spring Actuator, it also gives you endpoints where you can take thread dumps and heap dumps as well. So make use of them. I mean, there are tools out there that helps you look into your application and lets you know how it's performing. So I would watch out for stuck threads and make sure that it's, especially when you're looking at thread dumps, you probably want to look at taking multiple thread dumps at, say, a five-minute interval or a one-minute interval and see how the threads are moving or they're even processing something. Is it moving or is it just stuck at one place? So those would give you some ideas to how the threads are moving around, how it's being processed. And the other thing is container resource management. So where this comes up is that mostly when developers push their applications to Cloud Foundry, they are not very much concerned about how the container being, the resources of the container being used. So this mostly resides either on the operators or performance engineers are looking at the performance of the application itself. So some of the key things we should be looking for when running an app in production as well is you want to be able to monitor the CPU and memory usage of the container. This can be either, typically when you do a CF app and list out all the instances, you would be able to see the numbers there or again, most of the monitoring tools will give you as part of the JMX metrics, your CPU utilization and memory utilization as well. It may not give you the actual container memory itself whether the JMX metrics will probably give you just the heap metrics which the heap is being used. It doesn't have access to the memory outside of the heap. So in some cases, what you probably want to do is to write some custom scripts that will actually get the values from some of the CF APIs which would give you metrics of the containers itself. So we've seen situations where in older build packs, especially if you're using build packs 3.x and not build back 4.0 or 4.x, you've seen that if your applications or especially springboard applications sometimes require more native memory than it's allocated by the build pack. So we've seen that it's when you try, when you don't specify, when you don't allocate enough native memory that's which means that you have, so let's say you deploy your application with one gigabyte of container size, your container size is one gigabyte. It typically allocates 70% to heap or which I think five or 10% is to native memory. If your native memory requires more than the 10%, then you are, unless there is enough memory left in the container for it to allocate, it's going to eventually do a memory, out of memory exceptions. So what we ended up doing in the past is to set memory limit in our application where we reduce the amount of memory given to the heap from the default of 70% to say less than 50%. So we give more memory to the native objects. So we may have to play around with memory a little bit in some cases. It may not be true for all the applications, but watch out for it is what I'm trying to say here. Implement autoscaling. Autoscaling has been a big, big help for us in many, many instances. So, and there is, I think recently there was an incubator project for autoscaler that came out as open source. So please look into it and make use of it. It really helped us a lot. And I'll again show you some, if you have some time I can show you exactly where it helped us. Again, you probably want to look at how the memory management is done using the Java build pack. They've done a lot of improvements with the memory calculator and the recent versions of the build pack which is being used in 4.8 and 4.9. So understand how your memory is calculated and size your memory accordingly, size your container memory accordingly. So it took us a lot of iterations to come to a place where we are able to run our applications without having to encounter any out-of-memory exceptions. So understanding the way memory is calculated helps a lot. Heap dumps. It's, again, I learned yesterday that the actuator actually helps you. They have endpoints to take heap dumps. We are not using that at this time. I wish we were, but the way we do heap dumps is we have to log into the container, get the heap dumps, and we have to take help from the operators to help us do it. So it was not a very straightforward, it is not a very straightforward process for us at this time, but we're trying to more streamline that. But heap dump helps a lot in understanding situations where you see memory leaks, especially the graph on the right that we showed you where the memory was looking like it was leaking. So in situations like this, you can take a heap dump, use memory analyzer tools, and look at leak suspects. And this has helped us a lot. So make use of the heap dumps and make use of the actuator, the spring boot actuator as well, which gives you a lot of capabilities to take heap dumps or thread dumps in that case. I'm rushing past because I've got a lot of content to cover, so apologize or stop me if I'm going too fast. One of the things that really helps, which the developers may not be aware of, is doing some capacity planning before you deploy an application. As I said before, size your container, do some tests, non-functional tests. Size your container based on the heap usage. You want to make sure that you're not over-utilizing the memory and running into situations where you're running into out-of-memory exceptions. Number of instances. So you want to first run tests that would identify how many transactions a particular or one instance of your app can serve. And then you scale it based on the number of transactions that your service is going to expect. So let's assume your service can sustain a throughput of 1,000 requests per minute for one instance. Now, if you're expecting 10,000 requests, I would at least size it for 12 instances, so just to give you an example. So the only way to find out is to make sure that you understand how much throughput, a single instance can withstand and then scale it accordingly, but make use of it. In GDigital, we run three types of tests, essentially. Capacity, scalability, and endurance. Capacity is basically to understand how much a single instance of your app can withstand, or how much a single, when I say withstand, how much traffic it can serve. And then based on that, we optimize it for the single instance and then we scale out a scalability test to see if it actually scales out for the number of instances or the throughput that we wanted to serve. And then endurance is basically to run a very long-running test, say for a week or 10 days, to understand memory leaks, resource leaks, and so on. So these tests really help us to make sure that the Java application that we're deploying to production is well tested in all fronts. Understand your failure points. It helps to understand how much your application can withstand and how does it recover, so that's the key part. So every application at some point is going to face a situation where it's going to crash. Now, you need to be able to understand when it crashes, how does it recover, how gracefully it recovers, how long does it take, what kind of errors you can foresee. So this testing these would actually help you when you get paged at the middle of the night to understand where the problems are. So this has really helped us to actually get to a point where the application fails and then recovers when the load comes down. You've done some Chaos Monkey testing. This is the term coined by Netflix. And we randomly bring down instances to understand how the application behaves. If time permits, I would highly recommend doing some testing for any Java application. And for that matter, any application itself. And Cloud Foundry as a platform actually gives you capabilities to automatically bring up your instances. So when you try to bring down instances or try to crash instances, the health check actually helps to bring it up for us. But nevertheless, in some case it does help when you wanna bring down your database or Redis cache or something to that effect to see how your application is behaving. Let's look at some of the case studies. So I think we have about eight minutes left. So in this case, we can see that the first part of the graph, you can see the service was running fine. I mean, this is actually a graph of about three days. So the first part, the service was running fine. And then all of a sudden, you would see both, that's the entire heap. On the right, you see just the old generation of the heap. So you can see that the committed heap started going down, which means that the objects in the committed heap, the heap are not getting collected. And ideally, in my experience, when it comes to a point where your used heap and your committed heap gets together, that's where you see a lot of memory exceptions, which means it, but in this case, you can see the application was running fine. For the rest, for another two days, I mean, it was still continuing to run fine. I just grabbed it for three days. But in spite of the fact that your committed heap and used heap are almost the same, which means there's very little memory that's available for the heap to allocate for new objects in the old generation, you can still see that it's running fine. It's doing garbage collection, it's running fine. And what you need to understand is that the way JVM behaves in certain instances are situations where it might look like there might, there is a leak, but it typically may not be a leak at all. And the fact it's not a leak is because even though your committed heap and used heap are exactly the same or very close to each other, it still doesn't throw an out of memory exception. That runs fine. But the side effect of this is that you would see a lot more full garbage collections happening, which means a lot more resources or CPU is being utilized in the application. It's just the behavior of the JVM. In this case, I think it's JVM 1.8, 101, I believe, as a minor version. But it probably is just the way JVM behaves or the way your application behaves. Taking a heap dump will help. We took some heap dumps in this case. We analyzed the heap dumps. We couldn't find any leak suspects. So it probably is not a leak. It's just the way the JVM behaves. So as I said, there was no leak suspect in this case. The older build pack was working fine. In the sense that the older build pack did not exhibit this behavior. So we're still trying to find out what is new in the 4.x version of the build pack. So this is another very peculiar situation. So we were running a service with multiple instances offered. So on the left, you see instance one. On the right, you see instance zero. So the two instances. So on the left, you can see a very nice graph that you typically want to see. On the right, you see that it's, again, we're seeing a situation where the amount of memories is going down. It's the exact same service, same code base, except it's running on two separate instances of the app. Now, I don't have an answer for this. So I was hoping one of you can give me some ideas to what might be causing this. Excuse me. So we took some heap dumps. There were no leak suspects. So, and we're trying to figure out what was causing, and we're still scratching our heads to figure out why one instance might behave like this or that might behave like this. So what I wanted to show is that you would encounter situations like this when you deploy applications to cloud. When you run cloud-native Java applications, there are situations that you'd face. And one of the ways we got around this was to just go and restart this particular instance. And everything was back normal. And we still couldn't figure out what was causing this problem, but watch out for this. And it's interesting that it might come across this too. The last case study that I wanted to show is we run an endurance test for three days and it was a very small memory leak that we observed in all the instances. And the heap dump was not very helpful in identifying the leak suspect. And what ended up happening was one of the external dependencies that we had was leaking memory because the external dependency was, it was part of the JVM. It was doing a bytecode analysis essentially and that was leaking memory in some cases. So in this situation, what I'm trying to tell you is that it is possible that your code that you've written may not be the problem because we have external dependencies that we rely upon when deploying, that might be causing too. So one of the ways to troubleshoot your problem is that if you can't find a leak suspect in your code, try removing the external dependencies and see if that results your problem. So that in maybe four out of 10 situations that might be a cause of problem as well. So that's all I have for this and I think we have a couple more minutes. If I'm open for questions or I can show you some of the auto-scaling thing that I was talking about before. While I'm showing it, if you have questions please, please feel free to ask the questions. Yes. Yeah. Yes. One of the tools we looked at was Sensu or Graphoid and Grafana as well, like using Elasticsearch and we did use it as well as a backup and it works fine. I'm not saying no. So we used Elasticsearch for quite a week. You can actually write your own HTTP request, get the JVM metrics, put it Elasticsearch and graph it through Grafana or Kibana. It works fine. And yeah, yeah, it works fine. Exactly. These SAS tools gives you a little bit more, you pay money, you get a little bit more functionality of it. That's pretty much it. But yes, you can, yeah, sure. So this shows you the auto-scaler service that you're using so you can see it scales up and down depending on the amount of, so I think we had used CPU as our metric to determine auto-scaling, so between 100 and 300%. So I think every time it hits 300%, it will scale another instance when it comes below 100%, it'll scale down an instance. In our case, it can score up to 800% because it was an eight core runner, I believe. So yeah, auto-scaling works really well for us, so please make use of it. The other thing I wanted to show you was this is a live application that's actually running right now. And you can see the last 24 hours, it works just fine. So you can see the memory heap memory usage, the old gen, and if you look at the garbage collection, it's very nice. The full garbage collection was very spaced out, I think every eight hours, which is very nice. So this is typically how your application should be running. And you can look at the throughput for the application, which is actually, I think this is quite high. And it's probably around 10,000 to 12,000 requests per minute, I believe. So the service, yes, please. Okay, exactly. So one of the things I would look at is, I would of course take a heap dump to figure out what objects are in the heap, which actually makes it grow bigger and bigger. So you're saying the amount of time it takes grows larger and larger or the amount of heap? Oh, I see, okay. So which means that it's taking longer and longer for it to clear the garbage collection. But do you also see the fact that it's, the amount of heap is the same in that case? Okay. Oh, I see, okay. Let's talk offline, I think there's, I'm running out of time, so last question, yes. Yes, which settings is that? Okay, oh, I see. I have not used that feature, unfortunately, sorry. Oh, I see, okay. Okay, I have not come across that, unfortunately, in my situation. I may have to look into it. Oh, I see, interesting. Oh, interesting, okay. Thank you, I'll take a look at it, thanks. Appreciate it. I think I'm out of time here. Thanks for coming and I'll be outside if you have any questions, thank you.