 So welcome. I'm Matt Turner. I'm going to talk about dynamically testing individual microservice releases in production Which is a bit of a mouthful, but I wanted to get across what we're so actually going to be talking about Which is to test On an ongoing basis, you know new releases of an individual service as part of a more Complicated bigger set of microservices. So let's let's dive into what that means So yeah, lots to talk about a lot of these A lot of these topics I'm only going to be touching on kind of briefly It's a bit of background the meat of this really is in sort of the Istio config what we what we can do With this and the automation that I've started to build for it, but I will just touch on a bit of background as well I can't see my own slice from here. So I'll have to leave you and occasionally to remember what I promised I was going to say I Won't have to leave that list out so a little bit about me I am a software engineer at tetrate tetrate was founded to solve the problem of using service measures at scale our Management plane provides a layer on top of one or more Istio measures It'll install them for you upgrade them for you And it uses the tetrate Istio distribution for that which is a build of upstream Istio We haven't forked it But it is fully FIPS compliant and you can then use our UI or you can give simple high level config And we'll render that down to all the complicated Istio config you need for secure cross cluster communication And if you plug into your identity provider, we'll also let you set up Inheritable hierarchical permissions across all of that so that people can do mesh ops in in controlled ways a Little bit about you maybe who who's never used Istio or another service mesh Okay, that's probably half of you cool who is sort of a beginner and make it do something That's probably the other half anybody consider themselves of an expert like if I give you a problem you write the config to fix it Okay, a few people that's cool. I guess I've sort of I sort of got this about right then So I'm actually going to talk about this from the perspective of the problem We're trying to solve and I would then introduce You know a service mesh Istio in my demo as being like the way to solve that but this is not like an Istio talk This isn't a deep dive. So hopefully this will take people through it who haven't seen it before So briefly microservices, right? What are they? What do they look like in production? They might look like this This is this is a small part of Netflix's surface topology apparently They wouldn't give a higher res image So maybe they're a little embarrassed for people to be able to read what's actually going on Of course, you know, you can redraw it to make it more simple make it easy to read This is also Netflix apparently they've maybe got a little more connectivity than than they should have But you know when we do a when we do a talk when we do a demo We're probably looking at something like this, right? A simpler system This might be you know part of your system in isolation This might be your whole system if you're a new startup or this this might be for the reason with the DOMA paper from from Uber That's a really good way of talking about breaking things up into little isolated Sections. I think there's some animation on this slide that I had forgotten about. Let me go I'll probably come back to this point. Actually, I forgot this animated. This wasn't meant to do that So there's a slightly thicker hour there. So when we've got a given operation, you know, we've said we've got a user We've got some some web APIs that we're calling some external services We've got a database for persistence and any given operation that a user performs Probably isn't going to hit all the services, right? You know users request a distributed transaction Might even end up going through a linear chain of services like this You know more than likely some of these services are going to call multiple others But for any particular part, you know, we can look at it like this I am and if it's if it's linear then it sort of forms a chain and we can reason about things as a chain this Isn't what I'm you know, this isn't necessary for what I'm going to talk about What I'm going to talk about applies to to the big message that we've seen before But this is the mental model. I'll be using right this is the example because we can actually reason about it So imagine, you know one path user makes a request service call service call service and to be honest it often I mean it often is true like telcos are very big on this You know, they might have a like a site by a radio mask That's got a firewall and a net box and a media compressor and all kinds of other stuff And they will define chains out of them So depending on, you know, who you are your what service plan you've bought whether you're roaming or native to their network They're going to send you down different paths because they might firewall some people They might give some people more media services than others So sort of linear change like this are more common than you think and as I say, it's a it's a mental model we can use for More are particularly more complicated systems This is what the animations here if you and if you think about your services a little bit then they probably fall into a few different categories, right? So we probably have the blue things that sort of back ends They actually do the business logic the heavy lifting in order to put these web APIs, right? Maybe one returns XML because it's from the 90s. Maybe one is accessed over an IPsec tunnel, right? Hands up if you work in finance We can hide those complexities and those nastiness by making services that sort of shim them, right? So internally the blue services can all talk gRPC They can all use our internal auth mechanisms and rate limiting mechanisms And we and they could they will use those way mechanisms to talk to the shim services Which will then not do any logic but take care of the sort of transport Equally we can shim databases again so that whatever the database uses for or whatever the databases wire protocol is We can talk a sort of unified, you know, for example gRPC authenticated protocol to it And then maybe we've got a couple of front ends on front ends on that, right? So this is just sort of back end for front end pattern So when the user calls in they're going to get an HTML rendering or they're going to get a rest API or a GraphQL API So again, just something to bear in mind because when we Okay animations getting annoying now when we trace a request through a system like this the sort of the chain that we get Has just sort of front end Back end whatever you want to call it middle end and then maybe a database shim. So this is just something to bear in mind for a bit later So continuous deployment continuous release, which is the sort of problem. We're actually trying to tackle I think CICD has been a hot topic for you know, maybe two decades now I'm going to look at the continuous deployment part of that and what that actually means So say we've got a string of microservices, right? And there's a new version. There's a putative new version a candidate Of one of them. I'm going to call these things v1 and v2 You know, obviously they might be one point zero three five and one point zero three six or whatever We you might have multiple candidate versions at a time So multiple people working on multiple branches. That's all that's all I did that works But v1 and v2 is simple. So if I've got this candidate version, it's in red, right because it might not work yet But we want to start deploying it. We want to we want to test it. So how do we test it? Well, the agile agile testing pyramid says you should do something like this This isn't a bad model. It's not perfect But it's definitely not bad, you know, I have to I have to get that in there I have to add I think before even the unit test you've got a type system and like a borrower checker if you're using a good language But I mean the real point when I'm not just shitting for rust is that the bottom parts like happening in isolation, right? You're they'll happen in your CIC system. Any way you can run a Unix process The top parts are sort of testing in context if you like right there testing With other services and they need to have they actually have to happen in an environment So if we're gonna spin up a chain of services and do an end-to-end test, that's got a run in Kubernetes, right? That can't really be like a little test harness So most people do the sort of integration tests and above the end-to-end tests like this You have you know the build environment that's doing the unit tests the component tests And then you'll have a test environment. Maybe that does the integration test and a staging environment where we do the system tests To me I couldn't find a real definition of this There's a lot of copies of this picture on the internet Nobody seemed to want to actually put a stake in the ground and get a definition for each of these levels But to me system tests a manual test of both end-to-end tests, you know system tests that automated I guess a manual tests are done by a human And then we have prod where we actually release and where the service starts to get user traffic because we've tested it So how do we run these integration tests? Well, that's easy enough You can do this in the Wild West. I'm sorry. I mean I mean test right anywhere. You can deploy software But importantly not CI. This isn't a sort of unit test This in order to do an integration test a black box a service like this It needs to be subject to in a real runtime resource constraints real runtime security constraints It needs representative config files and environment variables But we can deploy it to an environment right and the little robots are like, you know Test scripts that may be called in call out test it like a black box But how do we do that end-to-end testing like how do we do the sort of manual testing or automated end-to-end testing? We need to emphasize exercise this service in in context in the context of this whole chains or more complicated graph How do we do that? Well, we can have a staging environment where all the new versions are deployed, right? But this this isn't representative because these aren't the versions if we're testing You know this one we want to know how it behaves when it gets deployed It's not going to see the new version of this and the candidate version of this it's going to be Sitting in between the two production versions, right? You know assuming service 3 is the one that gets through test and gets deployed first Like this won't detect a breaking API change Right because if this this relies on a new way this isn't backwards for as compatible it relies on a new API from here This kind of testing isn't going to detect that and this is obviously a big cause of breakages with microservices So to get around that we could have the feminine environments We could spin up a new environment for each service, you know even for each each PR right each branch of each service But these are hard to build The automation side to build they're expensive to run and they're still not prod They're still not representative anybody's ever built one of these, you know They're always a look alike of prod, but then they're never really quite the same So testing in production why if we want it to be representative actually testing prod why not testing prod? The issue with that is right is that our software is is so charity majors has this I testing prod thing Which basically says you're never going to catch all errors in testing Just release it and like and then deal with it when the users find things you never thought of that's I guess a little stage on from this But when our software is under test, we're not ready to release it We don't want users to be exposed to its results, right? We don't want it to get user traffic because the flip side is we don't want you Users to get results from it because they might be nonsense, right? It might still be broken so We get to win if we can separate that deploy stage from the release stage, right? So if we can deploy it if we can run the new version in Production subject to the although all of the quirks of the production environment, but we don't release it where release means You know, it doesn't get user traffic. So there's no risk to the business So we can separate deploy and release You know do we need separate test and staging and prod environments? You know, I would say not I can say they can all be one thing we can do all these things in the one environment So how does that look like? What does that look like? Well? We now have the technology, right? What I really want to do is just Just test this v2 So the user traffic is you know going to be coming in and getting v1 v1 v1 because we know they're all stable I want to be able to put it under test get some test traffic to go v1 v1 Up to v2 and then down to v1 again. So this is all going to be in prod You know, this is the prod database that I can read, you know realistic data from Prod constraints, but it's still under test. So it's only getting traffic from the test bot But the test spot or you know, or the user the developer is at the front of that chain So they can't just sort of you can keep CTL port forward to write to here, right? And you can do integration tests, but you can't you can't do an end-to-end system test like that So the test agent needs to be opt-in able to opt-in to test versions of v2 at an arbitrary depth down that chain So how might we do it? Well, if we add a psychoproxy to every service And then we add a control plane to configure those Well, we've got a service mesh right and with a service mesh We can take advanced control sophisticated control of all of the traffic in the cluster We can do advanced routing so we can deploy all of the v2s Like in a staging environment Sorry, we can well, but we deploy them into prod But we don't let them get any user traffic because these sidecars are doing advanced routing for us And we can then change those routing rules and we can put a little blip in the chain Right and send things up to up to v2 and this is all done by the service mesh This is all in Kubernetes configured by yamls. We're not fiddling around with IP tables or the VMware Network shenanigans when we're not doing like layer three nonsense. This is all nice and Kubernetes native So has this been done before? Well, yes, actually it has This slide did not load try reloading Okay, oh, it didn't load but I can show it anyway. Oh, there you go. Let's show it anyway Oh, it's missing a QR code. It's missing the QR code is what it's missing. Okay. I don't know why So I put a QR code to cite this. I'm not getting a preview of it either Um, so this was this was kind of inspired by a talk I saw called breaking up lifts deployment monolith given by a lift engineer Jake Hoffman coupon London earlier this year Essentially doing what I showed with the proxies But lift has that so lift are the people that made on boy, right? And on boy is the proxy that's used by Istio and a bunch of other service meshes They but they don't use Istio They don't use one of the available service meshes because they were the first me was with on boy They've got their own custom thing their own custom control plane So they did this they managed to make this work But it can't be reproduced by anybody else and there was a bunch of sort of custom seat They had to fork on boy and inject custom C++ code into it So it I think it was a great idea when I saw that talk and they said hey, this is the idea People should be doing this But it wasn't reproducible then I thought I'm pretty sure I can configure Istio to do that and then everybody can do it So can we do it with Istio? Well, yes, we can right because Istio gives us those proxies as well This year gives us control plane. It gives us native Kubernetes based configuration So we can do it with Istio and I'm gonna talk to you about how So just a quick recap recap of the Istio configuration types or the two that we're gonna need or maybe an introduction for people Who've not used this year before So these are CRDs right in Kubernetes We have the Istio virtual service Which is basically like given a request for a for a name service. Where do I send it? So if somebody's trying to send a request to service foo In Kubernetes, right? You try to send a request to foo you end up at the service capital S foo But with Istio like the virtual service kind of slots in before that this is okay You wanted foo you use your host foo in your in your HTTP request But where's it actually going so I can select different real services if I want to I can select Parts of them I can sort of identify subsets of them and say well, yeah that one, but we're only going to part of it And I can I can make those routing decisions based on headers or all kinds of other attributes of of the request So I can do basically this is what introduces the layer 7 routing right that is yo that sorry that Kubernetes can't do and Then the destination rule type it says right when I've chosen where I've chosen where I'm gonna root it to when we're gonna I'm gonna go to say service to how do I talk to that thing? So how do I load balance across all the pods in service to do I use TLS when I talk to them and importantly? Should I only talk to a subset of them? Should I pick out just a few of the pods in that deployment and only talk to them? So we've expanded a blown up the sort of service chain here, right? This is this is the previous This is service to this is service for then for service three like that naive Kubernetes is way right is I make a deployment and I put a service in front of it and it selects you know app equals foo when I just select foo And in this case I've labeled the deployment version one If I deploy a foo beta alongside foo stable then you know I label it v2, but Kubernetes doesn't know any better It's just gonna send the traffic to both right the Kubernetes 101 sorry to teach you sucking eggs And it's gonna do that relative to the you know proportionals and the number of pods in each one So I can come along with and these could be replicates rather than deployments, right? And that's how rolling update in Kubernetes works So I can come along with the just your CDs instead and I can slot that virtual service in front I can make these two destination rules and I can say Right, there's a subset of this foo thing. That's v1 And that's that's this deployment and there's a subset of this foo thing that's v2 And that's this deployment I can then slot the virtual service for foo in front and as we'll see it'll it'll say under some circumstances I want to go to this v1 part of it and under some circumstances I want to go to v2 And my I can tell my virtual service right your default config your default mode is to send all the traffic So all of this is going through v1, right? And so I can deploy V2 I can deploy it into prod, but I haven't released it because it won't get any user traffic It won't get any traffic at all at the moment because the virtual service says Hey, I'm gonna tell you how to tell v1 and v2 apart send everything to v1 But I can add some config to that virtual service to say Oh, well if the request comes in with with a header say x override And it's right x override. I want foo v2, please Then that can be sent off to to v2 instead So any you know arbitrarily anywhere in the service graph arbitrarily anywhere down that chain If you've got a header that says hey, I'd like to override foo to v2 We can send it through v2 and then back into the chain as normal So this is this is how we do you know what I'm talking about this sort of testing in prod this This override testing this is how we do what live recreate what lifted And this is how we configure it with Istio So what do those resources look like if you've seen enough Istio before to follow this We've got a destination rule for foo, right? So we say that this is talking about The host foo and this is the identity function, right? This is Istio 101. We're saying well, there's two subsets in this case Version one and you identify that by a label on it that says version one And version two which is identified by label that says version two We've then got the virtual service which slots in front and does a little bit more Routing so this is going to be I'm going to call this foo overrides this thing It's looking again for any requests you can actually override more than one service But it's saying any request to go to foo and then it's got this HTTP block And that's where it's just and again pretty simple stuff saying right match The header so match x override if it says foo v1 then it's going to go to foo subset version one If it says foo v2 it's going to go to foo subset version two And then our last stands are at the end. So these these routing rules This is like engine x configure at these things and match and applied in order We then have like a default route basically it doesn't have a match section in it Which says anything that didn't match either the previous two so people who you know people who aren't setting these override headers They're just going to go to v1 So normal traffic flow v1, but if you opt in with the header you can go off to v2 Or of course, you know that can be as long as you want any any other version There's one little caveat with this. This isn't quite what they look like You actually need to match like this. You actually need to use a regex to match the middle of To to match this x override header so it gets a little bit messy Just to go through the practical details right in case anybody goes home and tries to re-implement this The reason for that Is we might we might have more than one override right? I might want to go to foo version 2 and then bar version 3 and something else version 7 So we might have multiple instances if it's x override header because the value is service version If you if you use curl to make this request you with multiple x override headers Curl puts them on the wire as multiple keys, right? So you'll get two x override foo v1 x override bar v2 you'll get those separate headers Um That used to not be allowed by the spec the spec used to say you have to condense them like one key Several values with a comma between them The spec the the new spec I spent a long time reading rfcs the new spec now says You it you're okay to send them separately send separate keys with with values So what we're doing is fine, right and get does they say kill something like that Go lang forwards them like that because I'm doing the demo with a with a go lang put a simple go lang thing What envoy presents you when you're trying to match these things is the single combined thing again Not against the spec but a little bit of an old version So you get you get presented x override precisely once no matter how many times she specified it and you get the values collapse with with commas So we have to match a substring because we we might have presented x override more than once Envoy if you're configuring envoy manually, which you should never do It gives you several ways of matching headers I can say I want to match an exact string a prefix a suffix. I've got a regex or I can Use the contains thing to match a substring Istio only exposes part of envoy's api and istio does not give you contains Helpfully istio gives you exact prefix and regex. It doesn't give you contains. It doesn't give you suffix So regex it is The regex is in envoy are google's re2 syntax, which took me a while Because it's a little different to other things And that you have to and you have to match the whole string. So we end up with I think this is the tightest you can get I don't think we can be any tighter than that But basically saying well, you might be the carrot you might be the start of the string You might be the end you might be both because it might be the only value for this header But there might be something before it and then a comma there might be something after it Anyway, just so you understand like what on earth that horrible regex was for and this took this was the longest part Honestly, I was like I can I left that conference thinking I can implement this in istio. This is fine. This was honestly the longest part Um So the ammo is actually pretty simple, right? It's almost kind of basic istio usage But you're going to need one of each of those resources one destination rule one virtual service for every You know every service every workload you've got You're going to need a match for every version and it's going to need to be updated Every time a new version is deployed. So that's a pretty big combinatorial explosion And it's ongoing work as well, right if you're in a Automated CICD environment where multiple versions are going out all the time You're going to need to be updating those things quite a lot. So it sounds like a great target for automation and That's exactly what I did Um, and it was writing this, you know, I sort of went home with well I can do this and I wrote this and then that's what made me think I should give this talk To sort of tell folks about it So I've only got five minutes left. I'm going to try to give you a quick demo um If anybody saw my tweet earlier uh Doesn't the latest version doesn't work But I was able to roll back and the previous version does Um, but anyway, it's good enough. Hopefully everybody that's got a bit smaller than I thought What does that say? No No, definitely not Oh, it's because I'm okay I'm trying to, there we go. I'm trying to make my eye turn bigger I can't see that screen. So by the power of Tmux I have the Tmux same Tmux session attached here but anyway, um So I got a few scripts Just to make this a little easier. So I've got I've got a cluster running. I've got Istio installed. Oh, no, I haven't Oh, no How far did I get? Oh, I didn't get very far at all I was resetting this thing. Okay, bear with us while Oh, that's how far I got Okay I've uploaded the images, uh, so we shouldn't be there the mercy of conference wi-fi. They should all learn quite quickly Oh, that's really You're not seeing what you're meant to see You're not seeing what I'm seeing Istio promise And then I can deploy a chain of services Okay This is where you script demos This is why you scream record demos and sit at the back with a beer while the demo plays, but I didn't do that Um, so we're just going to deploy, uh, five services two copies of each, right? So service one to service five service one beta to service five beta As I say the images are loaded and the the services are just a little guy-lang thing I I write write that takes the request forwards it including all the headers Uh, and they they log what they do this show for you shouldn't take too long What else can I talk about while while this is happening? Um, so yeah currently the service needs to forward that header one of the future pieces where there was so the way I think obviously there are security issues with that right you could definitely going to need to filter that out at ingress So a user can't set it um, you I will have trust issues inside the mesh for that header because anything that's compromised can start heading it Setting it, you know, if it's a massive security risk to be redirected to it to a beta version I don't know, but um, it's definitely something you don't want to make this into the wild West So you need to go to trust that header. We could think of maybe some way of signing it But actually what Lyft did was in bed. They didn't use a header They they put this override information into the jot So they pass a jot round for service to service or with any way and they were able to stuff the header into the jot Um, that's something that I just honestly didn't have to have enough time to add to this operator yet That's definitely a way to do it. Um, that also gives the possibility for some more interesting stuff Uh, like we might want to do like a conditional override So one of the things Tetrax working on in its product is like conditional orth So service a can talk to service b only if like service z was further down the chain Right, so you have to go z to a and then you're allowed to go to b imagine if we could I'm You know one solution to that is is um embedded in nesting jots So imagine if we could do that with this, right? I want the override version of The database shim only if I've been through the override version of service one and the override version of service two So that's something I could start to to work on I've no idea even if this was downloading it should be quicker than this. I don't know what's going on I'm probably not going to be a show you this. Sorry about that Oh, somebody did say that docker hub is rate limiting kubecon Because we're all on the same ip. Is that what's going on? Yeah, maybe Well, there you go. Honestly, I've loaded the um I've been here before And I've got a couple of scripts one that pulls all the images out of mini cube and one that pushes them back in So they should that's why is to get most so quick. They should be there. Oh, well, I'm not going to take your time with this Honestly, it works. Um What we might have gone enough if we've got the services that's actually enough to generate the metadata Let me just push push all these in Where is uh Yeah Not one of the two http log. That's my little thing like that's the thing that should be Should be happening. We haven't got any of them, but we've got Okay, we've got the services. So I think I Um, yeah, if we were to make a call down that chain, we would get primary and beta versions at random Standard kubernetes load balancing can't show you that What I can show you Is the generation or this is just going to apply them So this went in and it ran so my code has got a cli mode and an operator mode So we can run as an operator watch the services the pods the deployments and continually admit the virtual services And the destination rules or there's a cli mode, right? So you can Um Try it out see what it's going to do integrated into a githops pipeline do that kind of stuff because all of you know What I'm doing is derivable from Other permanent resources you've got so this was the code running and we just piped it into qctl Applies you can see it sort of found I found a service it found all the versions to back it um If I ran I'm not going to show you this so I will If I just run it, I'm running cli mode locally It's going to connect to the cluster right and emit all of this stuff and this this looks as you would expect It to look there's also an operator mode that uses exactly the same logic and just just does a watch I promise So that's all the demo we're getting. Thanks docker hub Um Correct emoji I guess um What else right so there's a few caveats this that you know, this was kind of a proof of concept and I thought I'll talk about it And it does work, but I got a bit busy so it does need a bit more work I the moment the thing basically needs a namespace to itself Right, it's going to look at the services and emit vs is in drs Um, if you've got any other virtual services trying to do any more routing. They're just going to clash Uh, it does use the operator does use server side apply, but obviously it doesn't own those resources So it gets a little bit tricky. I think there's a way around that with um, you can delegate from one virtual service to another So I think I might use an emission controller to patch anything I find to delegate And delegate to myself and then do this thing I need to think about it, but at the moment it kind of needs a namespace to itself It is alpha. Um, you know suggestions and prs welcome Uh, yeah, so one of the stuff we wanted to do I talked about the header while we were waiting um Yeah git ops, so it's uh It's kind of like transient state right these things that we emit, but you can just You can run the cli mode as part of some kind of your generation pipeline that that Produces yaml's fuel git ops repo Um, yeah, it's going to clash with anything that exists, but I think I've got a couple of tricks. I can use to fix that Um, I probably don't have time to talk about that But it's what I imagine if you could set up a route where I'm talking to remember I talked about the different types of microservice if one of them is a database shim I might actually want to send gets to the stable ones So I'm reading from the prod database But posts database writes go to the the v2 shim the test shim, which is maybe just going to black hole them Right because I test order. I don't actually want it to go in the database I don't want logistics to do anything so imagine if we could do something like that um Oh, yeah, so if this thing's a database shim right I might say Hey, I've been through a test v2 So I actually want to go to a shim that's either going to send to a fake database Um, or just you know drop writes in memory or something So if we have this nice model of a a logicless, you know, a stateless database shim service Then we can start doing these things and it becomes a lot safer um Yeah conditional routing is what I just talked about So, yeah, um, that's really what I wanted to show. I wanted to talk about Why we might want to do this so that it's possible Show that it's possible in Istio And then talk show you how to do it and then talk about the this sort of automation I started using if anybody wants to go home and and do this themselves um The one last thing I wanted to show is a sort of if we're coming through, you know, the as as As a service goes through like it's CICD lifecycle right it gets built it gets linted Um It gets unit tested, you know And then we get to write let's test it in prod so we're not sending user traffic to it The basic way of doing that is by using a staging environment or an ephemeral environment I've hopefully persuaded you with the problems with those I think overrides what I just showed is the sophisticated way of doing that And you then that's really as much testing as we can do in isolation We then get on to you know releasing it separate from deploying it Will we get on to releasing it sending a unit um user traffic And you almost certainly want to do a rollout, right? You don't want to send it all the user traffic at once So the basic way of doing that kind of release is to let kubernetes do a rolling update or to to fiddle with dns records on the edge Um, the sophisticated way of doing that is is flagger So I think a combination of this thing, you know, you deploy Your tester comes along sets the header gets to isolate things and test them And then you can let you know when you're happy with that and you press the button and flag that starts to do a rollout to real users That's one I was going to show thank you I shouldn't have told you the code was broken because if we couldn't I could have just been our docker hub. Sorry. I can't show you anything Um, but I will think the version 0.2 works and I will fix it. Um, I think that's the end of the slot I'll be around if anybody wants any questions, but I won't I won't keep you here. Thanks a lot