 Okay. I think we can start. Hi, everyone. My name is Sonny, and today I'll be talking about web sockets and scalability challenges related to them. Let me start with an image from an incident we had some time ago. Yeah. It was a long Friday. We had to stay until 2 a.m., but we eventually fixed the issue. The alternative would have been to quit our jobs and move to another city. We didn't do that, fortunately. And we had a lot of learnings in this incident, but most importantly, it was a wake-up call for us that we need to improve the scalability of our system and the reliability. You might be wondering what's happening here, but before I go into more details, let me say a couple of words about myself. So I'm Sadi. I'm a software engineer and consultant at Netlite. Netlite is a tech consulting company based in Europe. I focus mostly on cloud engineering, cloud architecture, site reliability engineering, and more recently, I've also gotten interested in developer experience topics. In my free time, I enjoy basketball, photography, and usually Fridays without page duty calls. In this talk, I'll be talking about electric vehicle charging. Just to set some context, some basics. Then I will jump into web sockets and scalability challenges and actual concrete problems that we faced in our platform. And finally, our approach and how we addressed these challenges. So let's start with the first one, some context about EV charging. Let's do a quick show of hands. How many of you have driven an electric car already? That's quite a lot, more than half. That's nice. Then maybe you are already familiar with some of the frustrations with charging electric cars. Let's take a look at what that looks like. The EV driver would plug in their car into a charging station. The charging station is connected to the electrical grid, but also it has an internet connection via mobile carrier. It uses this internet connection to communicate to a system which is called a charging point operator. It communicates via OCP messages. It's just a protocol that standardizes how the messages should look like and the flow should look like. It's an open standard and it makes it possible to operate stations from different vendors. The charging point operator is a system that companies that are interested in offering EV charging solutions would build. In the last couple of years, I've been leading a platform team in one of our largest EV charging clients. In this team, the goal is to provide an easy to use, reliable interface to the charging stations. We don't have any user-facing products ourselves, but we rather offer this platform and the other teams build actual business products. We offer features like station management, session handling, authorization, remote start and stop for controlling your charging experience via a mobile phone for example, firmware updates and management, and so on. This is what the baseline architecture looks like. On the left-hand side, we have the platform services. Here we have a component we just called the WebSocket Gateway. It's the entry point for stations to connect to our system. It's limited in scope. It just basically maintains this WebSocket connection, does some parsing, some rate limiting, but that's about it. Then we have the station management service where most of the business logic is implemented, the OCPP stuff, session handling and most of the business logic of our platform. Then on the right, we have a few more services which are isolated in scope like the firmware service, history for time series data and analytics, diagnostics and so on. All of this is exposed by an API to the product teams, either B2C or B2B. They can use that to build stuff like home charging for people who buy a charging station and install it in their homes. Dealer charging for companies that provide charging at their facilities and a lot of other products as well. Now let me do a quick primer about WebSocket so even if you haven't used them before, you can still follow the rest of the talk. WebSocket is basically a protocol that enables low latency bi-directional communication between a client and a server. In our case, the clients are the charging stations and the servers, the WebSocket component that we saw earlier. Basically it provides a persistent connection over a single TCP, a persistent channel over a single TCP connection. Comparing it to HTTP, the benefit here is that the client doesn't have to re-establish a new TCP connection every time that they want to send a message. This connection can be reused and you can get rid of a lot of the overheads. Of course, it has a price. This price is that the server now has to maintain this state of the connection. The connection is long-lived and it's stateful in nature. You can read more about it. I left a link there. But basically that's more or less. It has use cases in a lot of applications that need some real-time communication, like for example, multiplayer games, collaborative editing, chats, stock tickers, and so on. That brings us to the trouble with scaling out WebSockets. We'll explore a little bit why is it different comparing it to an HTTP web server, what's exactly the challenge. First off, let's start with load balancing and horizontal auto-scaling. Consider here the situation on the left. We have a network load balancer with a standard brown-robin strategy. Let's say that WebSocket Gateway has four pods. It's a horizontally-scaled application. This would be an equilibrium state. Each of the pods is maintaining 8,000 stations basically, and everything is in equilibrium. But if there is a restart in one of the pods for whatever reason, or even if the HPA decides to scale up or down, this can change that situation pretty easily. Here, for example, pod number four was restarted. Those stations disconnect, and then they reconnect again. But the load balancer will simply redistribute those, and it will end up with a situation like the one in the right, where now we have some pods with more connections, some with fewer connections. Of course, with time, stations reconnect, and this will go to an equilibrium, but an indefinite amount of time is needed for this to happen. During this time, there can be instabilities. Some pods are serving more connections. Maybe they can run out of memory, crash, and maybe the clients will experience high tail latencies at those pods, so the system can end up in this state out of equilibrium, which can be problematic. The second problem has to do with something called a reconnection storm. I mentioned earlier that stations are connected to the internet via mobile carriers. Let's consider a situation where some stations on the left here are connected via carrier number one. Some stations or 10,000 stations are connected via carrier number two. And if this carrier number one has an issue, an outage, maybe a regional outage, all of those stations will lose the connection all at once. And then they will try to connect all at once as well, as soon as this issue with the network carrier is resolved. And basically what this causes is large, unpredictable load spikes, and it's a little bit like an accidental denial of service act. What would usually help here is retry with an exponential back off. However, we don't really have control over this because it's the vendors that implement the station firmware, and ideally they should do that. They should implement that retry with exponential back off. But in practice, we rarely see them doing it. And usually just the stations reconnect immediately after they are disconnected. And this is problematic. Triggers can be network carrier outages, as we saw as I described earlier. It can be even much more benign things like a release in the WebSocket gateway or a pod being rescheduled. This can start up, create a load spike, which might later lead to positive feedback loops. Let's look at the implications a little bit. The system is always at a risk of entering an unstable state. There is always a risk of these positive feedback loops. So let's say a large number of stations try to reconnect at once. Maybe some pods will crash, causing more stations to lose their connections. And then those stations will immediately try to connect again. And this can lead to the cycle, positive feedback loop, which then the system cannot recover from by itself. Another one is load spike propagation. So even in the case that the platform can handle the load itself, the downstream services might not be able to handle that kind of load, these spikes. So this is also an issue that plagued our system a little bit. And also another smaller thing, less serious, it's that during releases of the WebSocket gateway component, we know that there is user impact. So there is connection disruption. It doesn't make the station completely unusable, but still it makes the user experience worse. So since we know this, we introduce the maintenance window rule. And every time we need to make a release there, we have to announce this is quite an annoying thing to do. And also it goes against our continuous delivery practices. And back to our incident that we saw earlier. So this was also caused by one of these spikes in reconnections. And it was a combination of factors really that ended up in this incident. It was the size of the database, the target utilization of memory and CPU in the pods, the load at that time. But the root cause was something else. It was in the logger, some misconfiguration, we were logging to the logging API, and it wasn't able to accept all requests, cause some back pressure, which eventually led to pods being killed because they were running out of memory. So it was batching all of these log items for longer than it should, and it was causing this. However, this wasn't introduced in a release or something. The logger had been like that for a long period of time. What really triggered this is this combination of factors and the nature of the station reconnections and this reconnection store behavior. So we knew that something had to be done, and we needed to improve the reliability and scalability of our system. I really like this quote from Google's SRE book, reliability is the most important feature of any system, and it's certainly true in our case. So reliability is really the core value proposition that we offer to the platform teams. So we set out to tackle these challenges, and let's see how we did that. So how did we address these challenges? Let's start by looking at the connection flow again. I listed the steps here, but I won't go through them. It's more for completeness, really. But what's the takeaway here is that in the WebSocket Gateway component, which is horizontally scaled, as is the station management service, there we do the termination of TLS, we do some parsing, we do some rate limiting, and then the authentication process for the station happens in the station management service. And during this process, we save a couple of things regarding that connection in the Postgres database. And during times of high load, what happens is that this database can become a bottleneck. So this is the main takeaway from this connection flow that I wanted to show. So there are a couple of writes, a couple of reads that happen to the database, and during peak times of load, this can be a bottleneck. Some quick remedies that we could apply right away are basically over provision the database. Just throw some money at the problem, hope that it goes away. It's wasteful and costly, of course, but still, as a temporary solution, it's pretty helpful. Another thing, which is a little bit more subtle, it's to adjust the horizontal pod autoscaler stabilization window. So we have a couple of metric costs that we use for scaling up and down, and we don't want to react to every single fluctuation there. So because of this reconnection behavior, we want to avoid that as much as possible. So basically making this larger helps you smoothen out the reactions of the HPA and helps avoid these fluctuations in the replica count. Yeah, that's good enough for some quick remedies, but we also set some goals that we wanted for our long-term solutions. And primarily, we wanted a more lightweight connection flow that we could scale easier, and that would also help us alleviate database related bottlenecks. So as we saw earlier, we had to over provision the database by quite a lot. We also wanted to decouple the downstream services to stop this propagation of load spikes that I described earlier. And also, we wanted to reduce the disruption during releases. So find a way that hopefully we can remove this maintenance window rule and also have as little as impact in the user experience when we are doing a release in the WebSocket Gateway component. So these are some goals that we set out for ourselves. And now we go into a little bit more concrete approaches how we try to achieve this. First I'll discuss about optimizing the WebSocket connection flow. So some observations. We saw this earlier. This is not new. The key observation is that the connectivity data that we stored in the Postgres database was ephemeral in nature. That means that we don't really need the durability guarantees of Postgres to store that. And if we could avoid storing in the Postgres database, that would be great because then in times of high load, we don't really need to reach out to the database and we can avoid that bottleneck. So what we did here is basically introduce this service that we call the connectivity service. This encapsulates the connection logic and the connectivity domain. And it modifies the connection in such the flow a little bit so that we no longer need to store that data in the Postgres database, but we can do it in memory database. It boils down to memory being faster than disk. And for our case, this is great fit because even if let's say something goes wrong with the in-memory database, the server crashes, we can easily recreate that data by having the stations reconnect. Sorry. Without going into the backup mechanisms for Redis and so on. So even if it crashes, we can easily recreate it and have the data back in the Redis database. So fmrl data is a great fit for the in-memory database. And what we saw is that we were able to reduce the connection establishing time by one order of magnitude. And we reduce the load on the Postgres database, which means we can also now scale down to a much more reasonable size. We no longer need the database to be able to take in the peak of these traffic spikes, right? Which is orders of magnitude or at least one order of magnitude larger compared to normal traffic or normal load in the database. And we also were able to decouple the connection flow from other functionalities. We used this chance since we're revising the flow a little bit, we used this chance to also factor quite a few things. It was intertwined with some other flows like the certificate renewal flow and we used this chance to also do some decoupling there. So all in all, we were pretty happy with this. It was a long migration process and it took some effort, had a lot of subtleties, but it was really worth it in the end. Yes, but still another problem that we had remaining is that during a release, we still had a lot of disruptions and a lot of user impact when we were doing a release. So let's see how we tackle that. So one second. The root of the problem was in rolling updates in Kubernetes. So the idea with rolling updates is to incrementally replace current pods with newer pods, right? And Mac search and Mac's unavailable let you control this, how fast this happens. And there are some other configurations as well, but these are the main ones. Rolling updates are great. Again, they are great in most cases, right? They give you zero downtime deployments without basically having to change anything. And they are a perfect fit for most applications. However, there are a bad fit for applications sensitive to disruptions like WebSocket Gateway, where we have these long lift connections that we don't want to disrupt. Let's see why. So take a look at this scenario. So we have the station connected, let's say we have three replicas, three replicas of version one, and we want to do an update. And we'll start by basically killing one pod from version one, scaling up version two pod. And at that point, the station would lose connection, but it will then try to connect to another pod and establish it again, right? And then we repeat this process a couple of times until we have all of the pods in version two. You see, even in this trivial example with just three pods and one station, the station lost its connection three times. It had to reconnect. And this causes a bad experience basically on the user side. It causes quite some considerable impact. So ideally we would need to avoid this. And you can imagine what happens when there are multiple stations and many more pods, right? These traffic spikes, this is also a reason why the traffic spikes are amplified a bit because stations are connected or are trying to reconnect multiple times during release with rolling updates. And a way to tackle this is basically to switch to blue-green deployments. Let's do a quick overview of how that works. So we would have a replica set of three in version one. And then the idea is to switch the traffic all at once. So first we would bring up version two, the same replica set. We can run some checks, right, before promoting it, look at some metrics. And then we would switch the traffic of the stations all at once. And ideally here the stations would disconnect only once. So there is only one reconnection in total when this update happens. Eventually we can remove the version one pods. And then the release is finished. We used Argle rollouts. I left the link there to do this, to do blue-green deployments. It has a lot more features than we use and it definitely recommended if you are looking to use blue-green deployments. The plot here on the left-hand side also shows a little bit what I was describing. So on the y-axis we have the number of total connections. And on the x-axis we have time. And you can see that it really takes a while to recover to that normal load or to the point where all stations are reconnected. And comparing that to the blue-green approach, this window, this time window of disruptions is really reduced by a lot. And by some measurements that we did was around 80% on average reduction by simply switching to blue-green deployment. So this is really an example where it's quite a low effort investment, but it really pays off. The return of investment for this was quite large for us. And it really helped us improve the reliability of the system and also reduce the disruptions during a release. So that's the second thing that we did. But now let's take a look at the load spike propagation topic that I discussed earlier. If you remember I was talking about these load spikes that propagate to the downstream services and that's problematic even if the platform can handle the traffic, that's still problematic for the downstream services if they have to handle this traffic as well. So what we did is consider switching or switch to an event-driven architecture. How many of you are familiar with event-driven architecture? Have you used it before? That's nice, quite a lot. I think everybody has a little bit different. Everybody means something different with event-driven architecture. But a definition that I think a lot of people would agree is that it's basically an architectural pattern to build services which communicate with each other asynchronously. It uses events. An event is both a fact and a trigger and it's expressed as a notification. And usually by the name you can already tell it's always in past tense like station connected or station disconnected or charging session started, whatever. And then you'd have producers which are publishing these events when something happens in that part of the domain and they would put it in a queue where then consumers can consume it and trigger some logic on their own or simply ignore it. It would need a talk of its own to do it justice to this topic. There is a nice article from AWS and from Martin Fowler that I've linked there. The URL is not visible but the slides are in SCAD so you can look it up from there. And in our case the reasoning goes a bit like this. So it really helps event-driven architecture and an asynchronous communication way really helps with circuit breaking between the platform and the downstream services. And we could put a stop basically to the propagation of load spikes. In our case we use GCP pops up and for example the push subscriptions have a mechanism similar to the TCP congestion control mechanism so that it avoids overwhelming the subscriber if the subscriber has a tough time processing the request. So it doesn't overwhelm or it doesn't run into these bottlenecks that a subscriber might have typically the database. Or if you're using pull subscriptions there is flow control rate that pops up gives you. So basically this is the circuit breaking aspect that the event queues provide us. Another thing, another benefit of event-driven architecture is that it helps decoupling services and teams. And the goal is to have more independent teams and better split or clear better boundaries between the services. And event-driven architecture isn't necessarily the only way to achieve it but it helps and it usually makes it easier to achieve this decoupling. And also the asynchronous communication way fits well to the underlying business processes that we have. So we can model the business processes well and therefore it was a good fit for us and it helps us address a lot of the problems. However there are a bunch of use cases where event-driven architecture is a pretty bad fit. So even though event-driven architecture is pretty cool it gets you a lot of street cred. It's right up there with rewriting your system in Rust. It doesn't mean that we apply it mindlessly everywhere. So we really have to pick and make a good choice, a deliberate choice about the trade-offs that we are making. For example one of the trade-offs is eventual consistency. Not all systems can tolerate eventual consistency in their business flow. Yes, and I've listed a little bit of our approach here but I won't go through all of it. The main takeaway is basically that to be successful with a transition to event-driven it's very critical to lower the barrier of entry for the teams, especially for teams that are coming from a synchronous approach of building things. And not only help in the conceptual level but also in the technical level by providing good abstractions and good tooling around them to lower this barrier of entry as much as possible and make the transition smooth. And we had a lot of effort that we invested into that to make this transition successful. So that would be the takeaway there. And with that we come to a summary. So there is no panacea in addressing these challenges and analysis of the load patterns and analysis of the bottlenecks is crucial to inform your system design decisions. In our case we went a little bit through the challenges and now I'll mention again the core ideas here. So we made the WebSocket connection flow more lightweight by introducing an in-memory database which we can scale independently and scale more easily and avoid the bottlenecks that we had with the Postgres database. We reduced the disruptions during deployments by introducing blue-green deployments. We discussed a little bit how the strategy, the deployment strategy might affect the reliability of your WebSocket applications. And also we talked a little bit about circuit breaking and what you can do to stop these propagation of load spikes and even queues. And that's about it. I've listed some links you can find me online or best of all, simply grab me for a chat after the talk. And let's now go into the questions and answers session. Hi. Hello. You were talking about PubSub as a way to decouple the downstream server, but you're using the Google, so you basically have infinite scaling on your PubSub. If you're on an on-prem, for example, wouldn't you just move the blow-up of the overscaling from the downstream server to your PubSub system? That's a good point. It's always so often with cloud, this is a little bit of a misconception that, oh, I shifted to the cloud and now it's their problem, right? So one needs to be aware of the limits and the quotas and what happens when you get close to those limits. It's a good question. I think there is no straightforward answer to this, but I can say that in our case we did some back-of-the-envelope calculations and we are very much inside the limits that PubSub has. And when we get closer to them, we can revisit some decisions and think about it again. But it's a very good point. It's a very good thing to consider when using this. And it would be a misconception to approach it, hey, it can scale infinitely and I don't have to care about it. And it's probably even more concrete when you are operating this event broker, right? And you're not relying on some cloud. Thank you for the talk. When you talked about switching to red-green deployment, did you think about somehow making it gradual, like creating V2 all at once, but then connecting killing the V1 ports one by one so not all stations reconnect all at the same time because you also get this problem that all stations will try to reconnect at the same time. So did you think maybe about something like gradual red-green deployment so you will kill off all connections more gradually? That would be something to consider. We didn't do it. So when we were considering using Argo rollouts or some other solutions that were available, we also considered implementing it in our own so that we don't add a dependency basically to the cluster. We did some evaluation of the options and basically the total cost of ownership leaned towards just using Argo. I'm not aware that Argo has something like that. Argo rollouts have something like that that you can gradually kill off the V2. Usually, it just starts killing them as soon as the timeout runs out. I mean, just you have WebSocket long-lived connections. So it's a bit unusual use case. So when the red-green deployment happens, if you have standard request response up, it doesn't matter. But as you described, you have problem of all stations reconnecting at the same time and you kind of get it with red-green deployment. Okay. So you've mentioned eventual consistency and you're shifted through using an in-memory database in between the postgres to reduce the latency that's being caused through the database connectivity. What do you think about, I mean, couldn't you have used the database that provides eventual consistency instead? I'm not sure if I understand the question correctly. So maybe could you repeat it or rephrase it a bit? Yes. Instead of using the in-memory database in between, for eventual consistency, wouldn't it have been better to use a database that provides eventual consistency instead? So I would say the goal of switching or using introducing an in-memory database, the connection flow, was not achieving eventual consistency. It was rather making the flow more manageable to scale and make it independent of the other flows. So if there is a high load, we don't put high pressure into the postgres database, which a lot of the other business flows are dependent on. So we would decouple it from the rest of the flows and make it more lightweight. And we could scale it independently by just dealing with the Redis or in our case Redis, but it can be any in-memory database. We could scale it independently by just dealing with that. Yeah, so eventually consistency, I think it came up later when we introduced event queues, mentioned it as a trade-off that comes with asynchronous communication, but it wasn't a goal in itself to achieve by introducing Redis, right? So yeah, I hope that answers the questions, or if you have a follow-up, let me know. All right, it looks like no more questions. Thank you for joining.