 How many of you are developing Java applications? Not many. OK, not many. And I guess most of you are familiar with ECS and SQS. I see people nodding, so OK. Just a little bit of context, so I know what to cover. Cool. So I'll start. Thank you very much for coming. My name is Daniela. And today I'm going to talk about how to scale applications on Kubernetes and EKS. I work for a company called Learn Kubernetes. And I sometimes write Java as well, probably the wrong kind of Java. So why the wrong kind of Java, and why am I here to talk about scaling web apps? Well, because we all have something in common, right? Whenever you're writing JavaScript, wherever you're writing .NET, we're all trying to build applications that can scale, right? And they are robust. So take, for example, the app store, right? So you can imagine that every year, a lot of people, all at the same time, they try to go on the website and they want to buy an iPhone, right? So if you were to profile the traffic for that website in that particular time, you probably realize that there is a peak when people want to buy the iPhone and then it's going to slowly go down, right? Like this. So how do you actually engineer applications that can sustain such a load, right, without falling over? So if I were to do that, the first idea I would have is we need a website, so I'm going to build that. And I probably also need something which is going to process the transactions, right? So credit cards or whatever order we're going to put through this website. Easy enough, right? We know how to do that. So what we have, we can build a REST API, which is going to collect all the transactions. And then we can build a number of clients. Those clients could be things like a website or different mobile devices, right? So you could use your phone to buy items, and all of them are connected to this REST API. Now imagine that you're building this website and you need to handle all of these transactions, all of these traffic. Then what you probably want to do is just plan for it, right? So what you want to do is think about your user and say, well, we need a front end. We need a back end. But actually, just one event, just when the instance of a back end or an instance of a front end is not enough, right? What we need is a little bit more than that. We probably need like four back ends. And to distribute the traffic, what we need is a load balancer. And we probably need to do the same for the front end as well, right? So this is nothing new, right? We do this all the time. We create load balancers. We create more instances. And we distribute the traffic. We know that what we build can scale. Then if you're serious about your infrastructure, then before going live, you usually do also performance testing, right? So you just go and stress the infrastructure and see how your application behaves. So in this particular case, I took the example of one front end, one back end, and I tried to stress my application to see how many requests per second it could handle. And then through a series of iterations, I found out that the front end can handle 1,000 requests per second. So 1,000 people can go on the website and use it, and everything is still fine. And then if I go to the back end, the back end actually can process just 250 transactions per second, right? So you could think about there is a ratio between 1 to 4. So whereas the front end can handle 1,000 requests, I would need four back ends to support that kind of load. So that's my setup, right? So I'm ready from that day when everyone goes and buys my iPhone. Let's see what happens, right? It's that day, and I see that the traffic is going up. We've got four back ends, four front ends. Everything looks good until it doesn't. But we get so much traffic that the front end can cope just well with what you've got there. But the back ends, they're starting to fall over, right? They're falling over as more traffic is coming through. Remember, we have a 1 to 4 ratio, so we're putting too much stress on the back end at the moment. And even more back ends are going and dropping connections because there is just too much load. So what do you do? What do we do? Just leave it like that? Maybe not, right? Well, we can just log in into AWS. We go into console, and we say we need four more back ends. We're just scaling our infrastructure to cope with the extra requests. And that's great, right? So we are basically distributing the loads amongst these eight instances, and everything is back to normal, is it? By the time that you scale your infrastructure, then the traffic on your website just doubled, OK? So it took you some time to get these instances up and running, but actually it was too late. The traffic was too much that now this is the situation, not just the back ends, but also the front ends are struggling with requests that is coming through. And what happens is you see back ends going up and down, starting to go up and down. And because a back end goes down, and maybe it's restarted, then the load is distributed on the other server, and everything is cascading down. At the end of the day, all the back ends are going down game over. It's a sad story, isn't it? But perhaps, you know, I'm making stuff up. Perhaps no one does this. You know how to do it, right? You won't make the same mistakes as I did. Well, it turns out that even if you are Amazon, you could make that mistake. So you're probably familiar with Alibaba, the single day, where they have these massive sales. And then what Amazon did was basically saying, well, we can do that too. We call it Prime Day, and we put everything on sale. And that's what it did last year. And then this is what happened. So they put everything up and running, and then there was a lot of merchandise and sales, and eventually the website went down. So it crashed under the load of people trying to buy items, which is the same scenario we just described earlier. But why is that, right? Well, it turns out that post-mortem, they basically suggested that they didn't have enough service to actually handle the traffic that was coming through in their infrastructure, which is a little bit ironic. And they had to manually add service to cope with the traffic it was going through, which makes me wonder. It's not just about me as a small developer. It's also a problem which other companies, like big companies, could have. Now, this particular scenario was a little bit more complicated, as they had a very unique scalability issue with Oracle. But it gives you an idea on what kind of challenges you're facing when you're building this sort of applications. So my question to you is, is that anything we could do to actually avoid some of the problems that we described earlier? And then what I'm trying to find out is can I actually solve the problem of scaling? So we saw that we had to manually log in inside this Amazon console and scale the number of backends because we couldn't cope. So is there an way I can do it automatically? Well, we know that auto-scaling groups, we can set rules on how to automatically increase numbers. So that sounds easy enough. But the other challenge that we face is that when we talked about these frontend and backend, there was some ratio, right? We discussed about a ratio of one to four. So that basically means that, let's say for a reason, we have more traffic on the front end and we need to scale the front end, then scaling the front end is not enough. To cope with that front end traffic, we need to pair four more backends, right? So the ratio needs to be constant all the time and basically synchronizing two tiers of my infrastructure. But let's say that now you find a better way to write your code, you have a more performant Java application and then you know that you only need one, two backends for every front end that is there. And you need to go back in your infrastructure and just change the number from growing from 12, you change it to six. So you always need to keep them in sync, which isn't great, right? So the other things which is something that I like to do, if I could, if you think about processing these transactions, what happens is the transaction goes through, it goes to the front end and then it's processing the back end. So you can imagine a credit card, processing a credit card payment. But if the back end is unavailable, what happens is we process the transaction, we take credit card, but then it stays there, right? It's lost. What I'd rather do is to actually pile up all these requests for payment, right? And then process them later. So what if I could actually add a queue into my system, right? And the way I could do that is that instead of just sending the messages directly into my back end, I'm just gonna start piling them up. And when the back end is ready, then the back end can consume from this queue, you can imagine SQS, right? And then just consume the items and process these transactions. And then the idea is quite convenient, right? So if you think about it, it scales independently. I could have four front ends, all pushing messages, all pushing transactions into the queue and only one back end, very, very slowly, processing the transaction one at a time. What I could have four and four? What I could have one and four? So it's much better for me because I don't need to keep the ration, the one to four or one to two ration that we were discussing earlier, right? So these two pieces of infrastructure are now decoupled. The other interesting thing is that this gives me the benefit of just queuing up stuff. So if there is no one else on the other side, I can still save the request for payment and I can process that later. So what about, so we talked about handling failures, we talked about this messaging, what about the automatic scaling? So you're probably familiar with Amazon ECS, which is a good example of MISOS. But I think a new container in this space is Kubernetes. So why Kubernetes is not, why not ECS? Well, the reason is quite simple. So with ECS, unfortunately, you'll get a little bit of a black box, right? You know how it works, but if there is a bug or if there is something, you need to get in touch with Amazon and they hone the secret, right? They need to go and do the change. Whereas with Kubernetes, you can look at the code, you can change it, you can propose new features and add them in. The other reason is Kubernetes is multi-cloud, which basically means you can install it on AWS, you can move it to Azure, but more importantly, you can have the same setup on-prem as well. So you can have the same setup on both and just migrate the workloads from on-prem into the cloud. And then it's designed to scale. So any component of Kubernetes could fail and the process could still operate. So that's what we have. So a recap, so we can scale independently. We can handle value, so we can collect this credit card and then process them later. And then we know how to alter scale. So we know we can use ECS, we can use Kubernetes, but we perhaps we're interested in exploring this Kubernetes new stuff. Sounds like a plan. What if you're doing it? So this could go horribly wrong, but I'm actually going to try and do it now. Okay, so what we're going to do, hopefully you're going to, yeah, hopefully it's going to work. So what we're going to do, first of all, we need to build an application, right? So what we're going to build, I don't have the app store, so I've got a much simpler store where you can buy items, okay? That's it. And then what I have, so you can go and buy items as many as you want. And then I've got another section of the website where you can process transactions. To make it super, super easy, what I'm doing, every transaction, every transaction takes about five seconds to be processed. Okay? So if you've got three jobs pending, that means three transactions you need to process, that means 15 seconds on a single node. Let's have a look. So this is the application I'm using Spring Boot, but you can imagine you can do the same with any other programming language. And hopefully I just made a mistake. So first of all, so we said we discussed about a queue, so I need to start a queue. So I'm starting a queue, and then hopefully, okay. So what I have is, I basically build the application, so I'm gonna demonstrate it to you now. And then what it is, is just the very first part of this application, right? Where I can buy items. So I launch the queue, I have a frontend application, I can buy items, and I can buy some of them. So 10, maybe more 10. What happens now is I don't have the system that process the transactions, right? The all of these requests are just piling up inside a queue. What I can do next, no, the reason why I was fiddling with the variable is because I'm a little bit lazy, so the application works as a backend and as a frontend, and I've just got a toggle to change how it works. Okay, so I'm gonna toggle both on and just restart it. Actually, let's do just a backend. False. Let's see what happens. You see that the 20 items that we bought are now displayed on the screen, and then if I refresh, then this system is actually processing them five seconds at a time. And then I can go back to the console and hopefully you can see that the items have been processed here, right? So they are appearing as we speak. So that's the application. So if you are into Spring Boot, so this is your Spring Boot application, but all this is doing is just basically, every time you submit, we send some messages and then what we're doing is there is something called JMS. So we're just using JMS to interface with the queue. And then if I scroll down, you can see this is the line which is actually taking five seconds to process the item. So it's a very, very simple application where we're just taking messages off the queue, waiting for five seconds and then moving on. Very, very simple. So going back to our plan. So we built the application, we know how it works, now we want to deploy it. Now the interesting thing about Kubernetes is that it doesn't know how to deploy Java applications. Okay, it doesn't know to deploy any application at all. It only knows how to deploy containers. So containers are basically, the way you think about containers is you just think about your dependencies or the JVM, a number of files, and everything you've got inside your app, it actually gets packaged into a single bundle and that bundle is called a container. So that container is language-independent, so you don't need to teach Kubernetes how to run Java, you just need to package your application right to be run by Kubernetes. And then the way we do that is fairly simple. We have a recipe to build this container and that recipe basically just describes what kind of packages we need to install to make it work. So that's called the Dockerfile and that's what we use to actually build the container. Once we describe the steps, then we can execute them. So I'm gonna stop this. We just Docker build and this is gonna build the container. So now I'm not gonna do it because it's gonna take a little bit of time but you get the idea. So once we have the container, what do we do? Are we finally ready to deploy to Kubernetes? I hope so, otherwise this talk is never gonna end, right? Well, it turns out, yes, you need to deploy to Kubernetes. So what we do is we have, we need to deploy the front end and the back end and we need to route traffic to them so we know we need load balancers and we also know that we need a queue. So we're gonna create the queue, we're gonna create the front end and the back end, we're gonna connect it together and then what we're gonna do, we're also gonna set an auto scaling group on the back end. Now in Kubernetes, we like to call things with different names, okay? So load balancer is too easy. We call them services. The application, we don't call them applications because that would be just too obvious. So we call them parts. And then the auto scaling groups is actually called horizontal part auto scaling. And this is what we do. So I prepared most of the So what I have, so originally I had an EKS cluster which is basically Kubernetes on Amazon, okay? So Amazon offers Kubernetes as a managed service. Originally I had a cluster with on Amazon but there are a few interesting things about exposing services and just running the auto scaler. So I'm just at the moment and I didn't know if I had the internet. So at the moment I'm running everything locally. I've got a cluster running locally. And that cluster is running on something called Minikube. And then what I could do, I can use a command called kubectl to interact with the cluster. And this is telling me, yes, everything is okay. What I have next is, so when I want to deploy something into Kubernetes, what I do, I basically describe what the end results should be. Okay, I don't say deploy this. I just say at the end of a deployment it should look like this. Now, you go and do it. I don't want to do it myself. So what we have here is a description. Now, it may look like a lot of stuff, but you only need to remember actually a couple of lines. This one, just saying, I want to deploy a queue. This is a queue you should deploy. Done. And this one, sorry, this one, which is the port exposed by the load balancer. So if you're familiar with ECS, this is exactly the same thing you will see in ECS as well. And then once I've got those, I can just do a kubectl. So kubectl is the command that sends these descriptions to the server. So I can do kubectl create-aq-q. And that sends the command. You can see that returns immediately. Then the cluster will eventually have the application up and running. Okay, so that's just a queue. So let's deploy something else. Let's deploy a frontend. Let's see. So the frontend is very, very similar. Again, you can see a lot of stuff here, but the only line we're interested in is this one, which is saying the image that we want to deploy, right? And then we are setting the same environment variables that I was fiddling in earlier on. And that's it. And the other one is just the port that we want to expose this one. That's it. Everything else is a little bit of a boilerplate. So I've got this file. I can just send it to the cluster like this. Okay, I can check if the application was deployed. You can see down there, there is a frontend, right? Now, I want to see that though, right? I know it's there, but I want to see the website. So I can use a convenient command called minicube-service-frontend. And I've got the browser here. So this is my application deployed inside the cluster, inside kube-natives. And I can buy items. Are these items processed? No, right? We don't have a backend yet. So how does the backend work? Well, similar stuff. So the backend looks very much like the frontend, right? So it's a very similar way. So we said that we're interested about this line and then we're interested about this other one. And this is basically what kind of image I want to use to run the application and where do I want to expose it? So I'm going to do the same and deploy this application as well. Let's clear that. Just spell it. Sorry. Ah, that should be it. Done. There we go. So you can see I've got a lot of messages in the queue, right? It takes a little bit of time, but it's eventually processing them. And I can go back and add more and add more. And I can see them, you know, a lot of them. So do you want to wait for 300 items to be processed? Probably not, right? So what can we do? So we discussed earlier on that we could just, you know, log in into AWS console and change the number of instances and job done. So we could do the same, right? We have a number here, which basically defines how many copies of the backend we want. We could say three and then apply that. Oh, and then if I do get, you can see that it's running three copies of the backend. So if I go back, it should take, it should be a lot quicker, right? It's three times quicker than it was before. This is a very good improvement, right? But we discussed how, you know, minorly changing things could not be the best strategy to actually fix things. So what we can do, Kubernetes has got something which is like the AWS auto-scaler, but just for applications. And it's called the horizontal pod auto-scaler. And if I can see where, there we go. So the horizontal pod auto-scaler, what it's basically saying is which application should scale, right? Which kind of application should scale? So in this particular case, I'm saying, I want to scale the backend. You're saying, what's the minimum number? What's the max? Okay. And then you're saying a metric. In this particular case, you could scale your application based on things like CPU or memory, but I'm not really interested about CPU and memory. I'm more interested about how filled up is my queue? Is there a lot of stuff in my queue? So what I, and that's the goal, but how do I expose that number? Where do we get that number from? So that, so what we have here, I've got an added basically URL in my application where the only thing I do is basically expose the total number of items in the queue. And I expose that as something like called slash metrics, which is a Prometheus sort of standard for exposing those metrics. Then what I did in my cluster, I installed Prometheus, which is basically just a monitoring tool and time series database. So what is basically doing just, every 10 seconds is going and checking that number and then storing it. So later on, I can do query on it. And what kind of query I want to do? I want to make sure that the number is always below 10. If it is more than 10, I want to grow the number of backends I've got. Okay, so the combination of exposing the number and having the auto-scaler pointing at it, then it's gonna make automatically scaling happen. So let's see if it works. Hopefully it does. So I can create the HPA, the horizontal port auto-scaler in the same way I created everything else. And then at the moment it's saying un-node because it's just collecting the metrics. And hopefully it's gonna give us some reasonable number. So you can see there, target's 262. That's the number of messages currently in the queue. And then you see the number of replicas is three, which is what we had at the beginning. What's the number now, right? Kubernetes has realized that there is, we passed this threshold of 10 messages in the queue. So it's scaling out. How can we check that? So Kubernetes has got another command called describe. And then describe has got two parts. So the first part is called conditions and the other one is events. So you can see the event is basically saying, hey, successful rescale, I noticed that we passed this threshold, so I scaled to six. And then the next line reads, we are still over 10, so I'm gonna grow again the number of backends. So we've got number 10. So if I check the number of application up and running, you'll see I've got 10, effectively 10 application consuming messages from this queue. What will happen next is that, eventually we're gonna finish this, to process all of this transaction from the queue, Kubernetes will realize that the number is below 10, it's gonna start draining nodes and applications. So we've got like, in the same way you have the auto-scaler on things like ECS, then we've got this auto-scaler for applications where we can point to specific metrics within the application itself. Cool. So I deployed everything locally. But you can imagine that in the same commands, the same sort of specifications that I showed you today, then you can lift and shift that and get it into ECS, which is Amazon managed Kubernetes service. And generally, the way it works for EKS is that Amazon provides you with three EC2 instances in different AZs, and then there is an ALB on top, and that ALB is just managing the request which is coming through the API. And then inside of each of these nodes, you've got something called SCD, which is the Kubernetes database, which is just synchronizing and then keeping everything up to date. So that's why if you looked into Kubernetes and you probably realized there are things like GKE, so Google Kubernetes Engine, and then AKS, so Azure Kubernetes Service, you probably realized that these two are free, but Amazon is actually charging you 20 cents per hour. The reason being you've got three EC2 instances, one ALB, and these things cost money, whereas Google and Azure, they actually use shared infrastructure to give you these masternodes. So that's basically the reason why for them. So this is expensive, whereas for Azure and Google, it's actually marginal costs for them. So that's basically all I wanted to say. To say there are a couple of lessons learned, at least for me. The interesting thing that we did today is basically we had, well, we had one node, but you can imagine we're having three nodes, and what we did is we basically started with an application, we scaled manually to three, and then what we did, we basically triggered the auto-scaler. So that eventually what it could do is it could fill up all of the spaces inside your servers. So I don't want to be back to the same problem we had before. So it turns out that you can combine the horizontal port auto-scaler with something like the auto-scaling groups in AWS. So when you have all of your nodes, EC2 nodes filling up, what you can do, you can trigger the auto-scaler and just get more nodes, and continue scaling up with the horizontal port auto-scaler. So generally in Kubernetes, we've got like three types of auto-scaling, so we've got vertical auto-scaling, the horizontal auto-scaling, and this is the cluster auto-scaling. The other interesting things, I don't know if you've probably realized, but if you look at Kubernetes, what Kubernetes does is basically deploying containers. And in containers, you can think about containers as blocks. And then, you know, applications have a memory and CPU constraints. So some are more CPU intensive, some are more memory intensive. You can think about, you know, blocking in Tetris. And then Kubernetes, what it does is basically gonna place these blocks inside your infrastructure. But the interesting thing is, it's gonna try and pack these containers as closely as possible. The last, and I'm done, I promise it's the last two slides. So the other interesting reason why I really like Kubernetes at least as a developer is that, in reality, all the commands you see me typing on the screen, they're actually basically just API calls to the Kubernetes master node. And then you could think of that, you could write your own kubectl and do more crazy thing than just deploying stuff. That's it. Question one, about Tetris thing, we actually use Kubernetes extensively, we also place this thing. Have you tried somehow to manually or customize this thing to be able to keep Kubernetes completely Tetris more or less correct? I think, so the short answer is yes. A slightly longer answer is, depends what kind of application you are building. Usually Kubernetes can only measure these blocks if you provide limits. So you need, so you saw earlier that we had like a description of what the deployment should look like, but there is a section which I didn't have which basically defines how much memory and how much CPU that process can use. And then Kubernetes will look at those and then it's gonna play Tetris with those numbers. Now what happens if you don't provide the numbers for limits and request? Then Kubernetes look at the blocks and it looks very, very thin. It's basically a dot. So it's gonna place it inside the first node and then the second one goes in the first node and eventually it's basically gonna say, well, these blocks are weightless, right? It's gonna keep filling even if there is no space on the node. So it's really, really important. When you design your autoscaler, you also design the request and the limits and you test your application. That's why usually you combine a vertical portal to a scalar with the horizontal portal to a scalar. So what is gonna happen is you're gonna start with very little memory. Your application is gonna grow vertically and then when that is actually okay and that's enough memory for your application, then that's when it goes horizontally. Question about the previous metric that you were exposing. So the HPA that is set for space upon the number of metrics, right? How do you expose them that are coming to metric? So that's like a fairly big topic. So Kubernetes comes with some built-in metrics. Then we have got something called the metric server which expose things such as CPU and memory and space on the nodes and stuff like that. And then on top of that, you can have what's called a custom metric server which is what I installed. And that custom metric server is basically able to ingest metrics which are not just Kubernetes metrics, but any metrics that you want to expose. So the combination of two, these two basically gives you what you want. So in my case, I didn't really care about CPU or memory or nodes. So I basically installed these custom metrics, made sure that it scraped all of the metrics from my apps and then used those metrics from there. So the short answer is you should install the custom metric server. So that was probably in like the Kube system? No, it doesn't come by default. So if you're using EKS, it doesn't come by default. You need to store yourself. If you're using AKS, so Azure, I'm not trying to make a, it comes by default. Thank you. Thanks.