 Hello, my name is Stephen Gray, and I'm here today to talk about real-time Kubernetes, and how Entain Australia achieved 10 times the throughput with Linkerd. Entain is a FTSE 100 company, and one of the world's largest sports betting and gaming groups operating in the online and retail sectors. We operate all across the world, with brands and teams in dozens of countries globally, and Australia is one of them. As the head of trading solutions here in Australia, I lead the team of engineers that build and operate our trading platform, pairing our regional business. We've transitioned over the last few years from a non-agile release cycle to a lean agile management practice, with ownership and self-support of our solutions. We use chaos engineering, automated tests, continuous deployments, and even spot nodes to optimize our cost base. Our team is responsible for managing the torrent of data coming into the business, reacting to it in real-time. These data providers from all over the world feed us information continuously as sports and races happen all across the world. These come into our trading platform that we've built here in Australia. Our trading team interact with that platform as their main line of business tool. This then feeds into our transactional platform, and our customers interact with that platform. So specifically what we do is this piece here in the middle. Our software stack is based on very contemporary tools. So we use Go as our main programming language. For messaging and distribution of data, we use Apache Kafka. MongoDB is our main data store, but we do use other technologies such as Python, R, RabbitMQ, and even .NET Core. Our entire platform is based on cloud-native technologies. We host on Kubernetes. We are using LinkerD as our service mesh. All of our network communications use GRPC, and we use tracing instrumentation using the Jager and OpenTracing projects, and we do our deployments with help. In terms of key metrics about our team as well, we operate five clusters. There are 250 plus cloud servers in that environment. There are zero relational databases. We have hundreds of deployments of our software, managed by a team of 13 engineers, off which only one of them is a dedicated DevOps engineer. It wasn't always this way. Our microservices journey started off just with a single application. It powered the entire business, all of our customer-facing applications, but also the entire back-of-house trading operation. It was so large and unassailable, we called it the monolith, and it was 1.7 million lines of PHP. Almost none of that code was actually testable. Then we began our microservices journey. So in 2018, we had that single platform, our monolith, and it ran on about 100 commodity servers in two data centers that we operated in Sydney. And later that year, we started our journey to improve our data handling pipeline and move away from this monolith. This became three services. These services were focused on taking our sports data away from our legacy platform. It enabled us to break away and start iterating on those components in agile fashion, away from the release train of the main platform. At this point in time on our journey, most of the work was actually going into supporting the illusion and keeping up compatibility with the old platform. By mid-2019, these three services had become more than 20, and they were growing as we started to find our feet with microservices design. Incredibly, we were still keeping up the illusion for compatibility with the old platform. Throw-in and merger are now suddenly we found we had 40 services, but we're also now driving the same content to multiple customer-facing solutions, our legacy and new platforms. That was a flex that was only possible because we'd have happened to undertake in this work. In terms of the community's practice as well, it was really good for us because, like us, the new customer-facing platform was written entirely in go, and eventually, the monolith went away. Today, though, if you look just at our system, we're around 347 services. Just for our team. The wider Australian technology team also has a similar footprint again for the customer-facing side of the show. In terms of our service mesh design, it's actually quite chaotic. Here's a preview from Boyant Cloud showing the topology map of our services, but if we zoom in on just a small section of it, you can see that these tiny little blocks are the individual services that make up our fabric. We process around 10 billion external updates a month, and we need to do it lively as fast as possible. If our system runs slowly, if our prices are behind the gameplay for sporting events, that's bad for us. And if our system is down, effectively, so is our business. So as a team, we're in a continuous and rolling fight against latency in all forms. If you look here at this map, courtesy of Submarine Cable Map, you can see that the path from a stadium in Italy to us in Australia is not quite a straight line. And this is a best case. Quite often we find that data routes via the United States, even when coming from Europe. So it's very much a challenge to keep things sub-second. And given the nature of our work as well, it does mean that we have zero downtime. No planned outages are allowed. And we have to have high performance updating of the prices, but sub-second every second, 24-7. And as you grow a microservices stack like that, there's a few interesting areas and questions that come up. For us, the three major ones were, how do our services find each other in this topology? How do we manage the governance of what service does what, and when does a service need to be deprecated or removed? And also, how do we observe and measure our services to make sure that we understand what's working and what's not? So over the years, we actually used a variety of ways to approach service discovery and connectivity between our services. But as we transitioned to Kubernetes, we transitioned using the Kubernetes logical services. So service A would talk to service B by referencing its service name and route to the individual pods of service B. So we can see here in this example, pod A is balancing across pods X and Y from service B. Now we use GRPC as I mentioned for our backbone protocols. And this means that unlike HTTP One, we have fairly long-lived connections. Once GRPC has established a pool of connections, typically these will keep being reused for a fairly long time. And what we found was that a given service, the individual pods would particularly talk to other pods of other services because the connections were being reused rather than balancing across multiple instances. Now, what would happen there is with a large enough number of pods on both services, you roughly end up with an even distribution. So it doesn't sound too bad at this point. Now where things got a bit more exciting was as we rolled out code, we'd start gradually rotating through the instances of service B and we'd shut them down and start a new one. And what tended to happen was the very first service that came up on the new deployment would start receiving almost all of the connections. So as we rotated through the pods, all the connections moved to pod X. And that meant that service A was now heavily biased towards one instance. What would happen here is that we'd see high load for a period of time after every release. And this isn't insurmountable. So, you know, we just, at this point, we'd look at the graph to go, ah, the traffic's not balanced. And we would restart service A. Service A would restart and each instance of service A would balance cleanly across all of the instances of service B. The trouble then became that the things that now talk to service A themselves ended up being biased towards particular pods. And so we'd roll those and so on and so on. Eventually it became impractical. So, you know, cell of E, you know, we ended up in a situation where we were just letting things settle on their own. We do a deployment and suddenly servers would hit reach for the sky. And after a while, connections would eventually just balance out. One day we had to reload a large amount of data though. We had to process roughly two weeks worth of data in the space for a couple of hours. This went poorly. When we started bulk processing, we suddenly found that there was a lot of pressure in our system and that the load was massively imbalanced across our topology. And the latency started spiking. And this is not a good thing. We tried everything. We rolled services, we restart deployments, we scaled up, we scaled down. But no matter what we did, we ended up frequently with a chain of hotpaths through the system. And this would cause our total throughput to vary between near zero and nowhere near fast enough. And no amount of restarts would eventually help us. And as we started looking at this issue, the work became more and more critical. Hours became days. We started working late. We started to look at alternatives, including writing our own custom GRPC connection pooling code. Something that would forcibly balance the traffic between the pods of our services. And this is something we'd done before. But we tried loading in the data during quieter times and for a week, we basically lived on caffeinated drinks and prayer. And then one of our team asked, have we considered a service mesh? We had not. So we got to around midnight on day four or five here. And we started looking at the various options for a service mesh. Our team being quite small, we had to choose things that ticked a couple of different boxes for us. Given the number of services we had, we had to have a solution that had no code or configuration change required, because making code changes to 300 plus services and then rolling it back out if it doesn't go well, is something that's very hard to do. It had to be something that was simple to install and easy to operate. We didn't want to have to hire an extra person just to look after it. But it also had to balance the traffic better than we could do ourselves. So we found an article on the Kubernetes site after a furious five, 10 minutes of Googling where we'd found William Morgan from Boyant talking about GRPC load balancing on Kubernetes. And he was talking about our specific problem, which was awesome. And so after a bit of discussion within the team, some prototype installs of the various options we had, a lot's of cursing at configuration that we had to do. We kept coming back to that article over and over again. We'd found our front runner and that was Linkerd. So before we went into a large scale enterprise deployment of it, we took a lot of steps. We essentially joined the Slack channel and we talked to a few people. Nobody there seemed to think this wouldn't work. But obviously we were going from 0 to 60 miles an hour very quickly on this. So after a bit of agonizing about the risk of doing this in a production system, we ultimately decided that because we could uninstall and reinstall very easily, we gave it a shot. In terms of the deployment process here, we used Spinnaker for all our deployments. So we used the two commands to install Linkerd into the cluster, that was roughly 15 minutes of the process, most of which was spent reading the documentation because the installation process was essentially two commands. The next challenge we had then was for our 340 plus services, we had to redeploy them with some extra annotations to opt them into the service mesh. Now that took us a couple of hours because we had to go through one by one restarting things and doing the deployments, but it was far simpler than doing code level changes. And once we got everything installed and running and we saw our pods start turning up in the mesh, we just let it sit for a few hours because it'd been a long few days. And also just to see if anything happened under baseline load. We also caught up on some sleep. After a few hours of letting things sit and we didn't see any problems, we decided to kick the tires and start loading the data. From this graph, see if you can see when we started loading. It's pretty impressive in terms of a change. Our baseline was dramatically improved. Instead of saturating the 20 gigabit links on just a few of our servers, we were suddenly processing a network line rate on all of our servers. The CPU and load evened out across the cluster and we just kept on trucking. In fact, the first thing that we said after we saw the metrics, it's faster than it's ever been enough so that we were worried it might not be correct. So, our Hail Mary attempt to fix the performance issue worked. First time, but we didn't really understand at the time what really had happened there or how we'd got these gains. It turns out that because Linkerd was performing the connection pooling and rebalancing for us, our applications were balancing at the request level for each individual call while at the application level, still believing they had a pool of reused connections. This evened out the workload and for all intents and purposes perfectly. We could suddenly restart services, terminate them. We could go back to our chaos engineering practices all without interrupting the flow of data in real time. And the way it works is by injecting a proxy into each of your pods. These proxies communicate with each other directly by passing the other service discovery mechanisms and load balancing that are built into Kubernetes. The Linkerd control plane will watch for your pods being started and stopped using the Kubernetes API and maintain a rolling database of all the endpoints available for a service. Each service, when talking to other services, will use the data from the control plane to intelligently bias and weight towards particular targets. Our application code was entirely unaware that anything had changed. Like many people, we operate our application in multiple zones within a geographic region. And one of the surprising costs for many companies is that the cost of moving data between the individual zones inside a regional city is quite expensive. When we installed Linkerd, one of the things we found within the first couple of days was our bills were coming down. It turns out that because Linkerd is biasing by latency, it meant that there was a bias towards traffic staying within the same geographic zone versus spreading out across the geographic area. This alone, in retrospect, would have been worth the exercise for us. And aside from the cost savings as well, we had a few other small wins. We had service metrics out of the box so we could get rid of our own metrics collection for the paths from service A to B to C. There was instrumentation and tracing support, which was the start of our Yeager journey. And those traces could span across our network proxies and track the flow between services essentially for free. Additionally, we had mutual TLS support for all communications within the pod enabling us to encrypt everything that was going on inside the cluster. Whereas previously, I think we were more or less biased towards doing things at the ingress level. The reduction in the peak CPU load as well let us do things like re-enabling network compression. So our payloads could be further compressed, further reducing bandwidth costs. In terms of our journey onward, some of the major forward-looking changes we have left to come for our infrastructure include deploying Chaos Mesh more fully. We've got some stuff working with it now, but we want to take that to the next level. And we'll also continue on our journey with Linkerdee. We're going to switch from having three separate cross zone clusters to having isolated individual clusters per geographic area. We're also looking at the network policies changes in version 211 to secure traffic further within our cluster. In summary though, we have a lot of services. We need to do things very fast. And when we're using GRPC or other long-lived RPC protocols within Kubernetes, it's very easy to end up out of the box with hot paths in your platform and inefficient load balancing. We were able to deploy a service mesh to production to hundreds of services with zero prior experience in the space of a few hours. And we reaped immediate operational and cost economy benefits. More information about this is documented over on the CNCF blog, where myself and one of my colleagues have written an article talking about this issue. Additionally, I'm a Linkerdee ambassador and I'm often available on Slack to discuss and help people on their journeys to the service mesh. Thank you.