 let some people trickle in. Welcome, thank you for coming. My talk today is called Beyond Validates Presence Of. I'm gonna be talking about how you can ensure the validity of your data in a distributed system where you need to support a variety of different views of your data that are all in theory valid for a period of time. So my name is Amy Unger. I started programming as a librarian and library records are these arcane, complex, painful records, but the good thing about them is that they don't really often change. If a book changes its title, it's because it's reissued. And so a new record comes in. We don't deal with that much change in alteration within the book, data, obviously users are a different matter. So when I was first developing Rails applications, I found active record validations amazing. Every time I would implement a new model or start work on a new application, I would read through the Rails guide for active record validations and find every single one that I could add. It was a beautiful thing because I thought I could make sure that my data was always going to be valid. Well, fast forward through a good number of consulting projects, some work at Getty Images, and now I work at Heracu. And unfortunately, this story is not quite as simple. And so I wanted to share today some lessons I've learned over the years. First, kind of speaking to my younger self, why would I let my data go wrong? That would be how me five years ago today would be reacting to this. What did you do to your data and why? Next is prevention. Given that you may have accepted that your data may look differently at different times, how can you prevent your data from going into a bad state if you don't only have one good state? And then finally, detection. You know, if your data's gonna go wrong, you just better know when that's happening. And if you were here for Betsy Habels-Talk just before me, a lot of this is gonna sound familiar, just be a little bit more focused on the distributed side of things. So first let's talk about causes and how your data can go wrong despite your best intentions. And I'd like to start with reframing that by asking why would you expect your data to be correct? Five years ago, me would say, but look, I have all these tools for data correction, correctness. I have database constraints and I have ORM code. I have active record validations. They're gonna be in my corner. They're gonna keep me safe. So let's take a quick look at what those would be. So for database constraints and indexes, we're looking at ensuring that something is not null. For instance here, I'm trying to say that any product that we sell to you should probably have a billing record. For the health of our business, it's kinda important that we bill for things that we sell. And so this statement would keep us safe in the sense that anything, before we can actually save a record of something that we have sold to you, we also need to build up a billing record. The core allowing there would be an active record validation. The inspiration of the title of this talk where product validates presence of billing record. Although after I submitted this talk, I realized the syntax is a little bit new. So here, this is what you may recognize and I need to clearly review something. So why would this go wrong? Well first, let's get a product requirement that gosh, it's taken too long for us to sell things. You got so much work going on when a user clicks a button, I want that. We really wanna speed up that time. So you think, gosh, I've already extracted some of my email mailers. I'm doing all the things I can in the background. But billing only needs to be right for us once a month, at midnight hour, beginning of the month. Until then, we have a little bit of leeway. So why don't we move that into a background job? Well that leads to a kind of sad moment where we have to comment out this validates presence of billing record because we want to get our product controller to have this particular create method. And what we're doing in that create method is we're taking in whatever the user gave us. We're saying, hey, all right, we now have a product that we have sold. And we're gonna enqueue this job to create the corresponding billing record. And then we're gonna immediately respond with that product. And that's awesome for them. They can start immediately using their Redis, their Postgres, their app, whatever they want. And it just leaves us with, within a few milliseconds, we need to get that billing record created. So it sounds great. Unfortunately, what happens if that billing creator job dies? You're in a tough spot for having a product that is not in fact filled for. Then we have another fun complication. Your engineering team thinks, gosh, it kind of sucks that we're doing all of our billing and invoicing in a really legacy Rails app. That does not seem like the right engineering decision. So let's pull out all of our billing and move it into something that can be scaled at a far better pace for that kind of application. Well, now our job billing creator just gets a little more complicated because when it is initialized, it finds the product, builds up this data, and then calls out to this new billing service. And now we have two modes of failure. One is your job could just fail, but then most of the job could succeed, but your billing service could fail horribly, which leads to our fun discussion of all the ways your network can fail you. So some of these are easier than others. You can't connect. Okay, well, you can probably try again. No harm, no fail. Let's just give it a shot. What happens if it succeeds partially on the downstream service? It doesn't fully complete and you get back in error and you think, gosh, I'll retry. Well, is it gonna immediately error? Because it's like, I'm in a terrible state. I refuse to accept anything. Or is it going to say, that looks weird. Maybe I'll create a new one. You have another option. The service completes work, but the network cuts out in such a way that it thinks it's done, but you don't see that. So do you retry that and risk the fact that maybe you're gonna bill for something twice? And then this final one is kind of a corollary to that. Do you distinguish between the knowing which systems will roll back if they see a client-side timeout error? So with all of these aspects that are critical to designing highly-performant systems that are going to be distributed, I think we have to move to accepting that your data won't always be correct. Or at least it will be a variety of different ways of correct. It is perfectly fine now for a product to not have a billing record, because all that means is that the billing record is in the process of being created. What we want to be able to express is the fact that eventually we truly expect something to coalesce to most likely one, but maybe multiple, valid states that we expect it to spend the majority of its life in. Now, of course, that's not always true. People create products, or buy things, and then decide, whoops, that was exactly the wrong thing to buy right now, and then immediately cancel it. So you may not even get to see this thing finally coalesce into something that you might think would be valid. But what if you don't always know what correct is? So let's move to prevention where it's more about handling those errors. We've stopped really caring about making sure that everything is in a perfect state. Let's just sophisticatedly handle the errors we're seeing. So we have a number of strategies. The first I'd like to talk about is retry. I mentioned this earlier. If you can't connect, might as well just try again. But this brings in to question a couple of issues. First, you wanna be aware of whether the downstream service supports item potent actions. If it does, you're good, keep on retrying. Even if it succeeds, keep on trying. It's fine. The next is a strategy that if you're doing mostly just background jobs, you can implement some sort of sophisticated locking system. I haven't done that. It seems a little more work than I would want to do. But then again, if you are only doing jobs within one system, that might be the right solution. You can choose if you don't trust your downstream service to be item potent. You get to choose between retrying your creates or your deletes. Please do not retry both. Or have far more confidence than I do that your queuing system will always retrieve things in order. And the reason why you might think you don't have to choose is because, sure, if you put them on a queue, you can get first in, first out. Really well. But what if, and most of the time with the downstream service, you're gonna wanna be retrying multiple times? Right? Why retry just once? What if the service has a 15 minute blip? Should that require manual intervention? Probably not. You probably wanna say, hey, retry this thing like five or 10 times. If it fails on the 10th time, that's fine. But try it five times. Well, so what happens then if your delete call takes far longer to fail than your create? What that means is that by the second time round, your delete that is being retried for the second time is higher up in the queue than your create. And by the 11th time, I mean, who knows which one is gonna come off first. And if you end up in the unlucky position that your delete call gets pulled off before your create, then you're left in a situation with someone who just wanted to quick buy something, realize that they did something wrong, delete it, and yet they're being billed for this, add an infinite item, and nobody is happy. A final thing to mention with retries is, if you are gonna do many, many retries, do consider implementing exponential backoff in circuit breakers. Don't make things worse for your downstream service. If it's already struggling by increasing its load. Another strategy you have is rollback, which is a great option if only your code has seen the results of this action. So if your code base is the only one, your code base and your local database is the only one that knows that this user wants this product, absolutely rollback. But what about external systems? And the fun thing here is you need to start considering your job queue as an external system, because once you say, hey, go create this billing record, even if the end result is that that billing record is going to be in the same local database, you can't delete the product, you can't just have that record magically disappear. So roll forward would say, you have a number of options, right? You can in queue a deletion job right after your creation job. Even once you create something, you can delete it. You can also have cleanup scripts that run, that detect things that are in a corrupted state and clean them up hopefully very quickly. But rolling forward is all about accepting that something has gone wrong, but that something existed for just a short period of time. And we can't make that go away because something out there knows about it. All right, so you say, okay, this kind of makes sense, maybe. What does this look like for my code? First, let's talk about transactions. So transactions will allow you to create views of your database that are only local to you. So let's say I want to create an app, create a Postgres, create a Redis, I don't know, register like five users for that app, and also call like two downstream services with all those. If you wrap that all in a transaction, and any exception is thrown and bubbles up out of that transaction, all those records go away. Now you're downstream services, you still need to worry about those. But it's a nice tool for making local things disappear. With that in mind, there are a couple of things you might want to consider. First is understanding what strategy you're using. Usually this will be the ORM default. So if you were in Betsy's talk earlier, you saw ActiveRecord.Base.Transaction do. That chooses by default one of four transaction strategies. If you're in Postgres, if you read Postgres' documentation, you'll see they choose a sophisticated default, but please understand which one you are using because it has implications for what things outside of the transaction can see in and what they can. The next thing I'd like to suggest you consider is adding your job queue to your database. Now this, if this causes you absolute horror because of the load that you foresee putting on your database, you are correct. And this is a little bit like me. If I were from LinkedIn in the days when they had rumors had it like 20 people working on Kafka and then they told people, everybody should use Kafka. Heroku has a decent number of very intelligent people working on Postgres. That being said, if this doesn't totally terrify you, you should definitely absolutely do it because what it means is you do not have to worry about pulling deletes off of the queue. They just disappear. So instead of having that crazy race condition of a delete possibly outrunning a create, just never happened. You can write code as if you just were able to go ahead and think, but then if you have an error, it's as if that job never got enqueued. Next suggestion is to add timestamps. And I would suggest adding one timestamp to an object for every critical service call. So for a product that you sell, you might wanna consider adding billing start time and billing end time. And what you do is you set that field in the same transaction as you call to the downstream service. If the downstream service fails, it'll raise an error that you choose not to catch which will exit the transaction and result in that timestamp not being set. Timestamps obviously allow you some fun debugging knowledge and they do help you with additional issues debugging across distributed services. But the nice thing here is if the timestamp's not set, but you know that call never succeeded and you should be able to retry if you know that it is safe to do so. The next one I wanna talk about is code organization. And this is one where I don't have any panacea and is really hard. But I want to advocate very strongly that you think about writing your failure code in the same place as you write your success code. And what I mean by this is if you have a downstream service, let's say you're calling Slack. In the next few slides, I'm gonna talk about creating a new employee. So let's say you're uploading or Slack, so you're creating a new employee within your company Slack. The same place that you are writing that create call, please only a few lines away have the code to do the wind back so that no matter where your call, whether it's further down the line from Slack, wherever your employee creation fails, the code, the path of the code goes right back through that. And what it helps do is it helps your developers think about failure paths at the same time as they're doing successes. So what would this look like? So let's say we're going to create an employee. And we have this beautiful app. This is a completely contrived example. So we're gonna have a local database. We're gonna register them in Slack. We have a HR API. We're gonna upload a headshot to S3. Then we have another bunch of jobs, I don't know, maybe getting them all set up in GitHub. So what happens if, let's say, S3 is down? Lovely thing that I'm already standing up, right? So if S3 is down, and I wrote this slide the day before S3 went down, let's say S3 goes down, then your employee creator class has a pretty clear path for unwinding this all, right? You call the downstream HR API, you pull the user from Slack, and then you cancel the transaction that will have created the employee. And that's lovely. You can think through that, right? But this is kind of more like the code rewrite. And if this does not look like any code you've ever seen, congratulations. This is awesome. You should give a talk. You will get all the job applicants. So do you know what to do to unwind this mess if it fails right there? I don't. I have absolutely no idea. And sure, I can stare at this long enough and try to figure out what's going on. And I'd probably get close. But if I'm tired, if I haven't spent time with the Slack API since they updated it, I'm probably gonna make a mistake. So something I'd like to suggest you consider is something called the Saga Pattern, which allows you to create an orchestrator that essentially controls the path that things walk through and then keeps all of your rollback or roll forward code encapsulated in the same spot as the creation code. All right, so with that in mind, that obviously it's hard and we're gonna mess up, how do we detect when things have gone wrong? So the first thing I wanna talk about is SQL with timestamps. And since we have added at a previous date, timestamps saying deleted at, created at, and billing started at, billing ended at, we actually have some degree of hope of trying to reconcile things across a distributed system. So we may never get to this. We're definitely never gonna get to this. But with a bunch of different small SQL queries, we can get maybe close. So let's say we wanna tackle one small aspect of this. Shockingly, you all do not want to continue paying for things that you no longer have on Heroku. If you delete an app, we probably shouldn't continue billing for it. So this query may look a little bit complicated, but what it does is it says, hey, for our billing records and the things we have sold, find all the billing records that are still active, that are attached to products that are not active, as in canceled, someone deleted them, goes where the product was deleted 15 minutes ago. And what that does is it gives us 15 minutes for us to become eventually consistent into a state that we're pretty confident in. I say pretty not because we wanna continue charging you for stuff, but because let's say the billing API goes down for longer than 15 minutes. This thing is gonna start yelling at me, and that's a pain for me, but most of the time, I mean 15 minutes is a pretty darn long time. We're likely gonna be safe. So SQL with time stamps has a lot of benefits. Some of them are incredibly subjective. The first is absolutely subjective. I am far more confident of my ability to write business logic in really short SQL statements than I am about writing a very large auditing code base. That SQL statement to me was far more readable and something I can maintain confidence in that it will continue to run successfully. Then I am about writing the same thing in Ruby. That's probably gonna be something that your team is going to be different on depending on where you work. The other nice thing about SQL with time stamps is that you can set them up to run automatically. Betsy was talking about Sidekick earlier. We have just an app that will run these. We also have drag and drop folders, so make this easy to write new ones. It shouldn't be hard for someone to think, wow, that record looks weird. Let me write a check to see if there are any others like it. So these drag and drop folders will take SQL and they'll make sure it runs. Alerting by default, if you have ways of making it really easy and consistent, for us that means wrapping our SQL in Ruby files that say, hey, alert me if there are zero of these or alert me if there are any of these, the more common. And then finally, documented remediation plans. As an engineer on call, I have really no interest in relearning our credit policy. So I mean, I'm happy to do it because it means that my mistake is cleared up. But let's not have us have to talk to our head of finances every time. He's not gonna be happy. Some of the challenges here, as you might suspect, are non-SQL stores. And I specifically say non-SQL because you could be shoving structured JSON files in S3. I don't know what you're doing. But yeah, so no SQL, non-SQL, who knows what? And everything I've talked about so far has been built on the concept of the big, beautiful reporting database. And every large organization I have worked at has one of these. Like, you have so many distributed services and someone has just decided there will be a central one. I think it's probably a corollary of Conway's Law somehow, but in any case, what happens if, let's say, one of these is read us? For us, we usually try to just do a quick ETL script and if we need to, get it into Postgres. There's also the fully functioning model of just flipping this on its head. You don't have, if you don't want to use that big, beautiful reporting database and you are fully confident that you can write good admin code, then you open the doors to so many other options. So you can talk directly either to read us, get a direct read us connection string, or hit some API that is backed by read us. You can hit arbitrary APIs and then you can also hit all of your other distributed systems. For me, the concern is that writing an application that will talk to every single one of your distributed systems seems a lot more bug prone than just SQL off of one big massive giant database. But I've done this. So it really depends on the scenario. So as I mentioned, some of the challenges are non-SQL data stores where, you know, you can call pull transform and cache, those are usually the verbs we're using, but it's really just ETL. You can end up writing code in non-SQL which may be the right choice. The other challenge that we're running to is systems that do not have timestamps and so you can't do anything that says, like, gosh, I expect for five minutes this thing to be in flux, but after five minutes, if it's been created for five minutes, absolutely start checking it. If you can't get timestamps added, then I would move to a strategy close to snapshotting, analyze the whole gosh darn thing, write in records that say, like, hey, at this time it was correctly configured. At this time, this thing wasn't correctly configured, but hey, maybe next time it will be. And then we threw together some SQL to determine whether things are coalescing. You may wanna, again, do this in code. The SQL was about 60 lines long and included a self-join on a table and it's a little scary. The other option in addition to SQL timestamps that I wanna talk about is using event streams. And this may sound somewhat similar to log analysis, which it absolutely is, so if you're doing that, this will be very similar. So let's walk through the process of the events of buying a thing on Heroku. And so each time we hit one of these events, Heroku will emit an event to a central Kafka and we can read all of these events from one consumer. So for buying a product, we'll first see an event that says, hey, someone really wants a Redis. That's cool. We then move into sorted events on, okay, are they authenticated? Hey, is that product available? Are they allowed to install it? And this goes on, many, many, many events are emitted even for the smallest requests until we get to the end, which looks roughly like, hey, this Redis cluster is up and available. Billing has started, user response has been generated either to send them a web hook to say, hey, it's available or because they were in back waiting in line for us to do all this work. And you can start to see patterns. If the user is an average authorized user, right, we can create that list of what events we should see and in what order. And we can use this to determine whether something was actually successfully created and whether we should expect the data to be in the correct form at the end. So some benefits of event streams. It's a single format. You're not having to negotiate, oh, that thing is backed by Redis. That thing like, why are we still on flat files? Why? It is one place. And you can just register a new consumer to walk a stream or walk many streams. It has the added benefit of essentially black box testing your application. So again, if this sounds similar to log analysis or you're trying to determine whether your application is successful based off of, hey, if someone hits the search button, we should probably see some results returned and we see those, that kind of structure in the log. And therefore we're gonna validate that this AB deployment can slowly be scaled up. This is, it's very similar. Just used for a different purpose. I do have concerns about this approach and we're not using this explicitly for any business critical auditing right now. But it's something we've discussed heavily and it's the direction we want to go in as we factor things. So I wanted to show you some of the concerns I had with going down this road. What do you do if you admit the wrong events? Data on disk is something I have far more confidence in than whether we're continuing to emit the right event. I write typos. Anything that sounds similar, I'm probably gonna write. I'm probably gonna exchange it. I've been known to exchange cash for cats. In my defense, there were cats on my lap. But you might have random pillars like that. What if you continue emitting events, even if you're not actually doing the work, people make mistakes. And while it's one thing to scale up an AB test and say, hey, this canary deployment is great. We're gonna go full out with it. It's one thing to rely on events and log analysis for that. It's another thing to trust the health of your business to the accuracy of your events. And then finally, and this gets back to do you wanna be writing code that validates code? What if the stream consumer code is wrong? What is your confidence level that your team is going to be able to write really good auditing code? So this is the end of my talk. But I wanted to leave you with a caveat for what I have been proposing, especially towards the end, which is that everything I've been talking about is a lot of engineering effort. Especially building the beautiful, big reporting database if it's not there. Building an auditing system that will touch every single component of your distributed system. My time isn't cheap. And the reason why my company has chosen to invest in some of these is because there are certain things that we just fundamentally cannot get wrong. We've talked a lot about billing because it's a pretty easy thing. It's kind of visceral, us charging you for something that you should not be paying for. That's bad. But this also applies to security concerns. And for us, those are absolutely business critical. And it's why we're willing to put in this effort. But if you're building something that's a little more lightweight and is not going to take down the business if you get it wrong, maybe consider a lighter weight solution. In any case, so I wanted to say that I hope I have had something that was relevant for everyone in the room, whether that's talking about why your data might go wrong, how you might prevent it, and then detecting when mistakes inevitably happen. I wanted to say thank you. I really appreciate you all sitting through this talk. And I have about five minutes for questions. Nola's trying to clap. We can. We can. We can.