 Hello, ServiceMeshCon Europe 2021. Welcome to my talk about rapid experimentation, simplify with LinkedIn. My name is Ax Jones. I'm the principal engineer at Civo. Civo is a cloud computing company focusing on K3s, Kubernetes, and really driving home developer experience as a first-class citizen. In my former life, I've worked at Microsoft, B Sky B, JPMorgan, and American Express to name a few. And really, the financial services industry is what I'm going to focus on today because they have a lot of problems that are really compounded by the fact that delivering versions of application software is very slow, and on top of that, comparing the changes is very difficult. And so this is why the talk of experimentation is so pertinent right now because a lot of these businesses are going through a transformational process where they're trying to enable engineers to have the tools they require to be able to perform these low-cost experiments to determine whether a feature change is going to cause an application impact. So the agenda for today is we're going to talk about why is there a need for experimentation? Why do firms invest in tooling and why are things like LinkedIn becoming extremely exciting for very low-bar-to-entry ways of measuring those experiments and testing hypotheses? The apparatus of that experimentation, the technical side of how is this implemented, how difficult is it to use, and what are the kind of things I can do? Is that A-B testing? Is that chaos testing? Is that going to be canary testing? And thirdly, what is the implication of lowering the bar to entry on this kind of form of experimentation? And so before we go any further, it's really important to set the scene in terms of why is there a need for experimentation. Well, if we think about this simple example here, you have a V1.1 all the way to 1.4, let's say that you're a product engineering team and you are rolling down the tracks, building out these version changes. Now, what happens along here is that we start to see our SRE team are telling us that, hey, the latency of your application is increasing over time. This is a very coarse-grain approach of understanding the infrastructure footprint impact of our application, right? Whether that's compute, whether that's manifested as IOPS or some other signal, we are making a change in the environment, which is going to cause additional outcomes over time in terms of how that microservice or application interacts with other systems. And so therefore, it makes a lot of common sense that we want to measure scientifically what is the delta of change, not just in code, but also in performance, right? And so I think about this in terms of reduced signals as well, and I'll come on to describe what I mean by that in a moment. A secondary example is how does that application or service interact in a complex environment? If we're changing the version of several microservices, how do we know scientifically what the change will be to a queuing mechanism and how do we understand the knock-on implication? And I think this shows also that there's a real need to be able to not only inject faults, but to understand when there's service degradation, how does the environment perform? There's often a joke that the disaster recovery plan is nothing like what it looks like when you actually have to perform it. That's because that it's so high-cost in many organizations to actually perform a DR exercise that many of these things are just existential and aren't really testable. Another part of this as well and illustrated by the previous two examples is that AB testing should be easy, right? This idea of having some change and being able to test it should be easy. The problem with this sort of FAS-based example here is that between these two functions, essentially performing a failover, right, I have to test the new optimized database table against a new function and then fail back again. And this could even be a code change, right? This could be an optimization in the code, not necessarily the database. But the idea is you have to manually change the service, whether that's through an automated deployment config or some other human activity, it's not something you run simultaneously. Equally, this becomes even more compounded when you want to run, say, 20 to 30 FAS changes with many small nuances between them. So, AB testing needs to be easy and this is a really, really difficult problem to solve for an enterprise in a safe and scalable way. But wait a minute, right? Many people will tell me, hey, well, in our environment, we can deploy multiple versions. We have no problems whatsoever. We can have 1.1, 1.2, et cetera, on branches, PRs, yada, yada. That's fine. But let's really break that down, right? So you have your microservice alpha, V1 and V2. The delta of change is just the code and we go through the typical kind of routine of committing that code change. We deploy that through a pipeline. We get a new replica set. Yeah, that's great. But how do we then see what changes there are between these? We have to do some sort of activity where we observe the prior state and then look at the new state. And so we can look at a pattern of change and determine that there's been a regression. Okay, then if there's a regression, we have to go back to the drawing board and we have to figure out what that regression is. That could be a latency. That could be saturation, some other penalty we have to pay. The point I'm trying to illustrate here is that the cycle is fairly long and it also means that it's extremely arduous to do across multiple versions because that's just comparing V1 to V2. We should be able to compare V1 to V3 and V3 to V2 simultaneously. So the challenge is probably quite clear by now in that it's expensive, not only in monetary terms, but in time to promote changes to a new environment, especially within financial institutions. Multistage, multi-dependency chain of promotion is a really big overhead to have to bear to test what I would call a microversion, right? Like a small bet. Observability of these small bets has to be targeted. A lot of the labeling systems that you get out of the box or in these organizations aren't dynamic enough to be able to determine these microchanges. Now, whether that's a suffix on a version, whether that is a char on an image, we need a way of being able to pin the difference between changes and then to measure those over time. And with all that said, it probably really makes it clear that this is complex, right? This is a complex and often impractical set of ideas to try and bring across to an organization where they don't know where to start, right? There might be an application team who are delivering 20 different microservices and each one of those microservices might have a bunch of branches and a bunch of changes in each branch. How do you determine which branch is introducing or aggression in terms of infrastructure performance? So let's take a step back and just think about distillating those requirements. Here's your typical infrastructure architecture, right? You got an application that has a microservice and a queue and it might create something in a database. It stands to reason that we should be able to test in real time an alternative vision to this architecture. And in this case, it's direct calling the database, right? So this is fairly well known and well trodden this kind of path, but we find it difficult to do because it's hard to be able to tell the API gateway to send data to both of these without a code change in the gateway. And again, that is introducing more change and means that there are more unknowns to measure. Equally, we should be able to understand what happens if this service starts faulting without actually having to codify a fault into the service. And so therefore, there needs to be a way, thinking back to our disaster recovery illustration, how we can start to bring failures and chaos into the system and build more of the systems. And lastly, observing the difference across generations is really paramount to this succeeding. We can do all this stuff, but if we can't observe it in a way that isn't super coarse grain, then it's pointless. How do we bring rates, errors, duration, utilization and saturation signals all to the table and say between the version one and three, there's a massive regression, right? These are things that we need to be able to understand and understand how to measure. So that brings me to the apparatus of this experimentation. And really, after looking at a lot of different solutions, what we settled on time and time again was LinkoD. The two key tenants of this is traffic splitting and observability, both of which are underpinned by super easy to use DX that time and time again has saved us a ton of effort by just working out of the box. When we think about especially how these things work, it reminds me that there's a lot of effort that's been put in at the SMI spec level from the CNCF SIGs who are caring about the future of these kind of implementations and how the end users are going to work with them. And that's very much appreciated because when we look at the custom resource definition for how a traffic split should work, it's super easy to understand that in this example, there's 90% traffic balance to the V1 versus 10% on the V2. Equally, the alpha V4 of traffic splitting is taking us in a direction. We can start to perform front-end application testing inside of the mesh. And that's exciting because we can start to define headers that we care about. And in this example, it's a user agent of Firefox. So super exciting future going forward for enabling AB testing within LinkoD and other SMI. When we think about another big feature of LinkoD and this idea of traffic splitting, it's about visualizing that data and about developer experience. And I mentioned two or three times already that developer experience is super important because when you have 500, 600 teams and you amplify that by the amount of developers on those teams and they're all trying to work with mesh, their level of experience is going to vary vastly. And so when you have super crisp dashboarding and visualizations of what's going on, then it makes everyone's lives easier. And this is great because you can double-click into this so that if you are an engineer that wants to understand, like, hey, what's going on with the response codes, what's going on with the internal host headers, you can do that within the mesh, within LinkoD. Equally, for SREs, there's that super deeply ingrained Prometheus and Grafana installation that lets you be a bit more scientific over time. It's one thing just to deploy a service and say, yeah, okay, it's introduced some latency. It's another thing to then start pushing a load into that service and looking at how it performs compared to its prior generations. So let's look at a small demo now. So I've got this LinkoD demo repository right here. I've got a client that calls a version and that version is the one, two or three. And so what this client does is it creates a user. The user will then just sit in memory. Now, the difference between one, two and three is pretty small, but I'll show you. So version two, I've changed the swagger spec. And this is so that we can represent some code change that an engineer might make. And in this case, it's adding a required food preference field. Now, my client is super dumb and all it's doing is hitting with the default user field. So what's going to happen is I'm going to get a 422 because it's going to say, hey, I can't process this. I don't understand. Where's my food field? Equally, in the open API v3, what's going to happen is that I've introduced some latency, right? I just put some times dot sleep all over the code so that we can sort of emulate what would happen if there was a real sort of regression introduced into that service. And what brings this all together is the traffic split. So the default behavior of the open API client is to hit this v1. But what we're saying now is, hey, actually, I want to balance equally between the v2 and the v3. So let's go ahead and deploy that. Cool. So I want to visualize this naturally. So let's go. Linkety, there's dashboard. And if we go into here, we can see the default behavior as expected. Open API client is hitting the v1 backend. We can see, though, that the traffic splits has come online. And we have some prior data for the existing service. And what will soon happen is as the live data starts to come through from the new routes, we'll see these fields start to get populated. So you can see ones come up right there. And what's really, really exciting and useful about this is that it just works, right? There's no additional configuration. It was just what you saw. I just applied that CRD and away we go. And now what I'm interested to see is, is there going to be more latency on this route? Because as we saw, I've added in all sorts of sleeps. And we've seen that user creation as it gets round robin between these should start to get slower on this route. And, hey, Presto, we can see almost 10 seconds at the P99 on this route coming through. In addition to this, I could say, okay, well, what's the history of this v3 API? And again, clicking through to Grafana, it's great. I can look and see, oh, you know what? I can see over the past couple of minutes, this is something that I've known to be a behavioral change in this service. Or this is something that's absolutely fine. I love the idea that you can combine this with existing dashboards, right? So if I'm working in a compute constrained environment and I'm deploying 100 different micro bets, I can have a default dashboard that I add in that focuses purely on compute so that when I have all these different services, and if I refresh, now I should see them, I can tell which one is going to be the most performant change, right? And so these three could represent three different hashing algorithms. It could be anything that you want to try to test in a very low bar to entryway. So when we come back to thinking about what is this actually done in terms of lowering the bar to test a hypothesis, well, we're essentially creating an experiment factory. We're creating the ability for developers on their local machine and their lower environments, wherever they might be, to put a load of different small changes out there and test the ones that work the best. And when you start cascading this to not just a single microservice, but multiple microservices, then it gets really exciting. Because at that point, you can start combining it with chaos testing. So it's almost like an evolutionary Darwinism of microservices, where you're trying to see which one survives the best. So in this example, let's say alpha v1 and v2 actually have different routes. But what we can see is that the implication of those routes can be quite significant in terms of which is more resilient. And so we might find that, you know what, actually alpha v2, if we're knocking it down, it impacts the beta v1 service in a way we just didn't know was possible. And that is the power of traffic splitting. And with the observability that's afforded to you by LinkD, you can do AB testing, chaos testing, and you can do canary releases as well. And we found time and time again that developers find this so easy to use, they're starting to chain this stuff in ways we didn't even think was possible. And there's affording a lot more resilience at the product engineering level. This means that the candidate releases that actually go forward to production environments are innately more stable, because there's been a consideration around the signals of the infrastructure to which they're measuring. Now, obviously, there might be the argument of, hey, well, deploying to a lower environment isn't completely representative of a higher environment. Well, that's where the adventure gets really, really exciting, because you can start dipping into those higher environments, especially if you federate your mesh across clusters. So there is a strategy to scale this out to however wide your risk appetite is. Emboldening engineers, I think really is the crux here. And I think I've said throughout this talk that it's about trying to perform experiments. But ultimately, those experiments are about tiptoeing the path that helps to keep our service stability the highest. And rather than have SREs after the fact coming to the product engineering team and say, hey, you know, you've got a degradation here, it's about building that cultural change in so that people care about making those resilient services. And so the behavior that I expect that will start to be driven from this is that SREs and product engineers are starting to work at the same kind of level. You know, they're starting to think, hey, let's bring in our QA friends, and we'll all start to build these traffic-splitting methodologies so that we can start testing these microservice and really thinking about how can we break this, right? And that's kind of the hat you need to wear is, how can I break this microservice architecture? Because ultimately, the best genetic variation at the end of it will survive for the longest period of time. I feel like there's a super bright future for this sort of stuff. And I hope that you have enjoyed this talk. There are so many questions that I have yet to answer and so many things we can yet to talk about. And if anybody wants to follow me offline, we can talk and chat about all this stuff. But ultimately, watch this space, because canary releasing, AB testing, and chaos testing are all completely possible within Linkadee and are all being done today in hundreds of companies with a lot of success. Thank you.