 Well, hey, everyone, my name is Chris Crowe. I'm joined here by Stephen Atwell. And really, really happy to be here. Today, I'm going to be chatting a little bit about with a live demonstration, because I wasn't stressed out enough, so I figured why not add a live demo while connected to my cell phone in front of everybody, but here to talk about adding data integrity tests as part of CD pipelines. So this really came started, as I guess, a thought experiment between Stephen and I, because Stephen, in a previous company, you had a need for something like this, and I always flub it every time I try to explain it. Yeah, so at a prior company, we were a BI startup, so all of our product is reporting in analytics. And the tests that found the most critical bugs wound up being a test where we basically stood up a copy of the customer on both the old and the new version and compared all the numbers in the reports. Because if they don't match, you've got a problem, and someone's going to be unhappy about it. And over time, we wound up building out all this infrastructure that clone customer databases, stands them up, goes and does a giant comparison. And Chris and I started talking, and I started realizing how easy it had become to go and do some of this cloning of production data. And I told him about that, and then the next day, he pulls up this demo, and the rest is history. Yeah, I have too much time on my hands, much to my wife's chagrin most of the time, so. All right, yeah, and this is really an example of running data services right alongside the applications that are actually using them. So kind of a tightly coupled environment. I'm from the Seattle area, so I find a lot of times we tend to break those things apart, which has a lot of advantages, but also has some challenges. So I'm gonna kind of run through what I'm gonna show you, and I really should have updated the slide time. By the way, I think the slide deck took me longer than the actual engineering, because I'm that bad with anything that even is related to PowerPoint. So the first thing that we're gonna do as part of a deployment, and let me kind of set the stage here, because this is a CD pipeline that's hosted, so all of my configurations are hosted inside of GitHub. I am going to commit a change to this application. Now in this case, I'm actually going to upgrade the Mongo database. When I commit that change, instead of a simple, hey, let's deploy this to the staging environment, or let's deploy this to the production environment, I'm gonna have it jump through a few hoops. And the idea of this is we really do is, I guess to kind of start a conversation about maybe a better way to be managing some of these tests. So when I do this commit, instead of simply applying it to production, we're going to start by scaling down the Mongo instance in staging and triggering a migration. Now, there's a number of ways to do this that I won't get too much into the details of, but you wanna make sure that you're getting a copy of production over to the staging environment, because that's the value that we're after here is being able to do the same sort of calculations on the database in the staging environment as we do in production. We then deploy the new manifests into staging and scale the Mongo instance back up. Now, at this point, I can trigger, and in this case, just through a GitHub action, an integrity test. And this is where I'm gonna ask everybody to use a little bit of their imagination, because the application I'm using is just a little fake barbecue ordering app that has a backend for a Mongo database. And so, of course, the integration tests that I'm running are specific to that app. I'm making sure that the orders actually make sense, that I can connect to the database, that the users are all working, things like that. If it fails, then I'm gonna stop the process there, but if it passes, then I can move this on into deploying into production. For this slide deck, I'm assuming it passes. Then I'm going to deploy the new manifests into production, and then I'm going to run a canary deployment where I am going to make sure I check Prometheus metrics as we're scaling this traffic just to make sure everything is fine. And actually, Stephen had a great idea with this one, because we're monitoring the warnings that Mongo is throwing out there to see if they exceed what we would consider a threshold. It doesn't really matter what they are, but hey, if our installation starts throwing warnings, then maybe it would be a good idea to roll this back. So, high level, we've put both the application and the database in the same CICD pipeline, and we're testing and staging using a copy of the production database to make sure that our data-specific tests are all working before we deploy the new version to prod, and then we're doing canary deployments in production. All right, so people know there's no tricks up my sleeve. I'm just going to go ahead and do a quick order here. Guess who doesn't like cornbread and fried pickles? Maybe? I don't know. I still want to know what the pork special is. I have been asking that question. I need to track down the guy that actually wrote this little demo app. So, I have something that's now persistent in the database. I'm going to go ahead and upgrade Mongo from 6.0.1 to 6.0.2. This is where I always like the dead air of live demos. Because by the way, I can't really talk and do something at the same time. I'm going to go ahead and upgrade this to 6.0.2, commit the changes. I'm going to go ahead and put this into a new branch, create my pull request. And once he merges that, we're going to go ahead and basically kick off a quick build, grab the new Docker images, and we're going to kick off that database migration and deploy it all to staging. See if this new version of 6.0.2 works as well as we think it does. And I'll unpack the back end after we've gone through this. Do we have Jeopardy music in the sound board in the back? No? We need a few. There we go. Okay, so new deployment starts. It's going to start by actually, and spoiling a lot of this for you. There's a lot of webhooks. There's a little open source utility that we're talking to that's doing a lot of the gating, right? Or actually running these commands. So it's what scaled down staging, the Mongo instance. It's what started the migration by, you can imagine running some kubectl commands. And it's what is waiting for this migration to actually finish that we can see here before it moves on to deploying the new version of these manifests into the staging environment. So yeah, the open source tool is called CMD hook. And it basically is just a wrapper for running arbitrary command lines via webhook inside your cluster. So the migration is literally going and pushing all our production data from prod into staging. Takes a couple of minutes to run, and then we'll be able to go ahead and test that our database does indeed match. This started actually running in my basement at one point and I was still using that. I figured for here I would actually put it inside a proper cloud vendor and stuff so it should be a little quicker. But the SLA on my garage is not as good as I would like it to be. And for comparison, like five years ago, just doing this step was like six months of engineering for the first time I needed to do it. There we go. So that one's finished. We actually started the deployment staging. So this is where we're actually deploying that new version of Mongo and it's going to kick off that integration test by way of a GitHub action, which we'll see it running here in just a moment. So what are we testing with this new test that we're running, Chris? Actually, or? So yeah, we've gone back and forth about some of the integration tests in this one. So originally I had a little Flask app that was connecting into the Mongo database to do exactly that where I'm checking the amount of orders and everything else. And then I would hilariously have it randomly fail. And I'll tell you one thing. You don't want random failures induced in live demos. So I'm actually looking at the Mongo version on the back end and failing it at a particular point for demonstration purposes, which you can just see happened. So because this failed, we're actually stopping this deployment here as you would expect. Now, of course, this entire thing is running inside of staging. And because it was an artificial failure, it really is working. Oh, right, I have to, I've gone through so many different accounts because I keep forgetting the passwords for them. So now we're on to Stephen's various ones. And there we go. So I have a copy of my production data that's actually in the staging environment. So if something was actually wrong, I could spend some time looking at what was happening. Now, I came from an infrastructure background, so I'm gonna go ahead and do what infrastructure folks do best and plow ahead to 6.0.3. Is there any ex-infrastructure folks in the room before I shouldn't have? Yeah, see, I gotta, I really need to change the order of my jokes here. Okay. It's okay, most infrastructure folks have a sense of security. Okay, all right. All right, so let's go ahead and upgrade this to 6.0.3. All right, we'll go ahead and merge it. And we'll notice back here, by the way, that since we failed this step, that we are not going to keep going. I'll wait for that next one to kick off. Now, I'm once again, taking another production copy of data. Now, one of the most, I guess I'll shortcut that since we have a moment, one of the common questions that I'll get here is around data anonymization. And generally your best bet is to do that on the source side. So using maybe a snapshot convention with your CSI driver or however that works. But because oftentimes we'll find that staging may not be in the same zone, security zone as production. So keep that in mind. But your best bet is to actually do those transforms on the source side of things. I'm gonna work on integrating that. My last name is Crowe. So I think the standard joke is replace it with Raven or something. Yeah, one of these days, we are planning to write the anonymization layer that replaces all occurrences of Crowe with Raven. Because why not? The other option that some folks use is basically just run all of this in a production-like environment. If your staging environment meets the same security requirements as your prod, then you can usually get away with that. But a lot of the time, staging doesn't meet like all your SOX controls, so. Yeah. Do be aware of that. Did that go? All right. Oh, this is especially speedy today. And with any luck, we'll pass this time. About every fifth time I run this for the internet gods decide that I'm not going to be able to pull down the Mongo image as fast as I would like to, and then we'll be tap dancing up here for a few minutes. So hopefully, I didn't jinx us. Now, something that's important about this process, because through this entire process, I upgraded to 6.0.2, what version of Mongo is production still running? Well, 6.0.1, it hasn't passed any of the integration tests. Now that it has succeeded, we are finally going to get to deploying into production. And by the way, as a quick aside, using the same system of DR, maybe backup restores to seed this, I think is a pretty good strategy, because the more practice we can get with ensuring that our important data is easily restored, if it's already part of the system like this, then hey, the best DR strategy is one that we're testing constantly, right? Running through our canary analysis, this is just Prometheus checks on the backend. I believe we're actually checking CPU usage and Prometheus. So we're checking Prometheus to, oh, sorry. Oh, you're right. Actually, they're both Prometheus. They're both Prometheus, yes. We're checking Mongo for logging any errors, and then we're checking CPU usage as well. Make sure both of those are as expected. And since it went in fairly quickly here, I put a little approve and continue just in case I didn't make it to the screen in time. This is just a manual step that I added as part of this deployment process. And obviously, if any of those metrics had not been kind of in the realm we were expecting them to be, the system would have detected it. We would have automatically rolled back and been back on 6.0.1, and thus stable in production. Like that one, it looks like we had a check that may have failed. But too late now, it's in production. The other reason, I love live demos. All right, so what's happening under the covers? Let's take a quick look at some of those webhooks that are happening here. So from a deployment logic perspective, what we're actually doing here is we have three deployment targets. And so the way to kind of read this as a YAML file is by looking at this depends on statement. So staging depends on the migrate broad DB to staging actually worked and production depends on staging properly deploying. You'll also notice the use of webhooks in here. And this is what I'm gonna move down to because this is the actual things that we are deploying. But these webhooks down here, what's doing the real work? For instance, when we're waiting for the database migration to actually happen, we're waiting for this migration option to have a final state on it. And once it does, calling back into our CD tool to say, okay, you can keep moving. Or when we're kicking off, if I can find it in here, I really should have reordered some of this stuff. There we go, our integration tests. So in the case of our integration tests where we're kicking that back out to GitHub to actually to run a GitHub action that's doing those checks that I mentioned before. And so this is really the piece that's doing a lot of those gatings. That ends up being a little bit of a long file because we have our Prometheus checks at the end. So that's really everything we have for this one. And really our hope with a lot of this is to kind of spark that conversation is what more can we do if we start putting data on Kubernetes to start treating our data services in the same way we would, the applications that we're constantly testing and deploying without having to send this through a different loop as part of a database upgrade where we're having to do either manual things or additional things or just all sorts of stuff that is gonna ultimately end up being a little more delicate. And I'm a big fan in a lot of the applications that I'm working with of having those things tightly coupled because you can imagine this application, data services are running just within the same namespace so it can be nice and portable so I can get away with doing this stuff. This is not going to be appropriate for every application but our hope was to kind of have a bit of an idea. So our next bit is to try to do something like a blue-green deployment that also involves data services that I think would be a lot of fun to figure out on how to get those to cut over properly. So I guess with that, and we're even ahead of schedule, is there any questions, comments, death threats, tomatoes? We did confiscate the tomatoes, right? Okay. I hope so. Last time that was a little messy. No one's reading, no one's reaching for anything. Hi, thank you. So what's the size of this database? So for example, I have an application that the size is gigabytes and gigabytes and it's the ruin, right? So I don't know how for every change I make, I'm gonna be replicating the data. So yeah, and that's the other common question that I get because in this case, the database is tiny. It's really tiny. So I can take a complete copy of it every time. So generally what you're gonna wanna do, and this I would definitely think of it in kind of the same way that you're doing a DR setup but the answer's gonna be something like incremental replication or some sort of incremental change. So constantly have the production version of your data syncing with the staging version and it should be able to also sync to any other DR location and then when you run these integration tests, as it actually be a good thing to add, I've experimented with doing it but what you actually end up doing is you break that replication temporarily while you're running the integration test so it can happen immediately instead of having to wait for this migration to happen because if we put too long of a tail just in time as part of the CD pipeline, then it's not gonna be helpful. So break it at that point and then the other important thing you're gonna need is some way that that resynchronizes so you're not having to send the complete terabyte database over the wire again to propagate your staging environment. So the two things I'd look for being able to break and promote the replica side and then being able to replay any changes that have happened while the integration test happens. So a couple of things I'll add there. So the company that inspired this, we had terabyte databases and the way we dealt with it is everything here was doing full CD triggered from a Git commit. If you've got terabyte databases, you're not gonna wanna do this off every commit. So at that point, I would recommend taking a few of your smaller databases setting up a nightly job and then hey, before you go and you do like a larger release process, you can use a set of databases and make sure that they are all looking good. As you start getting into needing database tests, when your databases are terabytes, you're probably gonna be looking at a slower deployment cycle as well. And there are tools like that for handling slower running tests. If you need to use data in your tests, in your test suites, and you've got relatively small databases, you can do this after every convention, have it still be reasonably fast. Yeah, no problem. Any other questions? So you guys mentioned a little bit about like synthetic data. So a lot of tools like Tonic and stuff, they'll be able to like, if you have like say terabytes of data, only use 10% of that data, right? So you guys talking about like having these massive data is just bring it down. And then on top of that, talking about like database startup time, you can also cache the startup of the database as well, do like a nightly job. So you have the tech that when you don't really need it and then when you're actually deploying it, it's like super quick. Yeah, yeah. So on that, I think trimming data down can depending on the application be a lot harder than it sometimes sounds. Basically the instant you have different tables that need to join together, if you're shrinking both of them, you have to be sure that all the joins are still working properly. We experimented with that at my last company and it was too, there was too much custom logic in there for us to get away with it. Yeah, fair. Like we could have that issue as well. Once a month it would break because the data sets, the links, everything would break. Yeah. Well, and you know, depending on the schema, if you can get away with it, I agree it's a great way to make it faster. It probably would have broken this test because one of the things I'm checking is that the amount of orders is the same before and afterwards with that Flask app. And if I start trimming those, then I will actually trigger the integration test as an issue because, well, I've noticed a problem here. It's not within a tolerance. So, but like I mentioned, the integrity tests that you're running are very specific to what you're trying to accomplish. Right, it's not, and that's why I tend to not show something up here because it's not a one-size-fits-all. It's very, yeah, very specific, so. Any other questions? All right, well, I will be, well, actually I'm not around this week because hilariously I booked a vacation right over the top of KubeCon when I wasn't sure this would actually be accepted. So I'm flying out tomorrow morning, but Steven will be here for the entire KubeCon and can answer any questions. So, and I figured I'd start earlier with the Hawaiian shirt, so. Yeah, if you think of a question later on, come find me, I'll be at the Armory booth. And on the data side, we've got his whole company is here and they're happy to answer questions on data. All right, thanks, folks.