 I'm Betsy Habel and welcome to data integrity in living systems. Now, ordinarily, I'd like to launch right into the talk content here. We've got a lot to cover, and that's what y'all are here to see. But Marko's keynote just now really hit home for me, and I wanted to follow up on that. Obviously, I'm a white woman, and he's a black man. And these are really different experiences. I don't want to collapse the subtleties of that. But a lot of the ways he frames survival were the ways I do in my head. I often paint my experience getting into tech from a non-traditional background as this super simple happy path theater to tech manic pixie nerd girl thing. And I do that because it's always safe to paint yourself as an eccentric genius. Let's get real for a second. A decade ago when I was getting into tech, I was just some chick with no college degree who had just washed out of an arts career because of a sudden onset chronic pain disorder. And bootcamps were not even a thing yet, so I couldn't even get a certification that way. I was doing this all on my own, and that was terrifying. And I learned a lot of bad survival lessons then. And this is relevant because there are some places in this talk where I talk about places that I used to be kind of a self-righteous jerk. I want y'all to remember, when you're hearing about those parts, that I was only able to properly grow out of that after I wasn't the only woman in the room anymore, after I wasn't so alone, that the only way I could move from surviving by being a jerk into nuance and kindness and the actual success they bring was to not have to carry this torch of being the only representative of my gender there. Anyway, back to data integrity. Now this talk used to have another name. It used to be called data corruption, stop the evil triples. I didn't change it because that was kind of a hokey name for a talk. It is. But that reason would imply caring about my personal dignity. And let's look at this example of fine photo-modip art. Why don't we? Don't worry. Instantly, there are plenty more humorous Star Trek images ahead. You know they must be humorous because that's how I'm calling them right now. Anyway, I changed the name of the talk because it imposed this frame of bad data being an evil invading force from outside. And literally, everything else in my proposal was about how counterproductive that frame is. When we visualize bad data as this mess that clogs up our otherwise pristine database, we do two things. First off, we pretend that there is such a thing as a pristine database or a pristine anything when we are computing. And let's be real. Most of us work on code bases, architectures, et cetera, that look a lot more like this. And the second reason is that it trains us to think of the situation adversarily. When we think of this as the triples versus the crew, we're making a world of heroes and villains. We are making a world where heroic developers are fighting bad data. It is real easy from there to think of ourselves as the heroic developers fighting the sources of that bad data. That is benign when it is computers that we are fighting. But this attitude quickly turns into thinking of our users as the enemy. Or even worse, thinking of our teammates as that enemy. It is super easy to get self-righteous about data integrity issues. But once we start doing that, we lose our chance to solve the actual problem. When I think about the actual data integrity issues I've dealt with, product changes are actually the usual cause. Or maybe miscommunication between teams. There's also a case where people don't use well-known data integrity patterns like transactions or Rails validations. And I'll talk more about this later, but I'd just like to dive in right now and say, no, this is not your responsibility like you think it is. If we're going blameless and looking at root causes, we note that when people forget or even consciously skip these data integrity patterns, it's because their code base they're working in is architected in a way that actively discourages their use. Sometimes, yeah, you're going to encounter weird computer nonsense. Like there could be a bug in Postgres that emerges at 3 a.m. on the fifth full moon of a leap year and somehow your data model is peculiarly vulnerable to that. But you know, in the interest of scope, I'm going to just not cover that. I'm going to focus on the 90% case, which is that your team culture and code structure are creating a situation in which bad data is likely. Especially because like designing our systems to be resilient against common issues, like the product changed or an engineer made a mistake because they were toughing out a mild flu. Like designing our systems to be resilient against that lets us also proactively detect and correct an awful lot of data integrity bugs that stem from the harder stuff like concurrent standard load. So that we all have a concrete example to hang our understanding off of, I am going to tell a story. This is going to be loosely based on a job I had a few years ago at an e-commerce company. The details are not going to be exact. I would pretend that this is me changing names to protect the innocent, but it was a few years ago and let's be real, I forgot. This company ran a Rails monolith that was old enough, big enough and entangled enough to deserve that name. And my team was working on a module inside it that took returned goods and shipped them back to their vendors for credit. It was not a tremendously complicated system in and of itself, even though the larger monolith was pretty complex. In fact, while neither the UI nor the underlying data model looked exactly like this, you can actually form a pretty good model by looking at this. You can go a long way by assuming that the only three things that returned to vendor cared about were the product, its vendor and whether it had been shipped back or not. Of course, no matter how few things your system cares about, the world will come up with a way to only give you partial data. In our TV's case, sometimes our incoming data stream would not contain vendor information. Spoiler, it is super hard to ship things back to their vendor if you do not know what that vendor is. In other words, the return to vendor module had unearthed new product requirement. And now that we were returning things to their vendors, the upstream system needed to guarantee that they were recorded. We had an issue, but the issue was created by the RTV module's existence, not by any inherent issues with the upstream code. Because of a product change, the data that had served the system's needs perfectly fine the day before was suddenly invalid. And this is pretty common. Data integrity issues that are caused at the product level are often best also solved at the product level. We could have decided to tough it out, we could have like done machine learning or something to identify all of these missing vendors. But what we actually did for the feature was a small requirements change and a new piece of UI. Small requirements change just don't display the units that are marked for RTV, but don't have vendors in like the main RTV display. And new piece of UI, if we don't know the vendor or something else about the unit, disambiguation interface, like kind of the virtual equivalent of the desk that the warehouse workers chuck things on so their supervisors can deal with it. Which is a lot more efficient than like a more complicated, more technical solution. And the big takeaways that we can get from this story, first off, scope data by whether it can move forward or not, and closely related, validating models in this absolute way is not going to be complex enough for complex business operations, or it's not going to be flexible enough for complex business operations. And so we need to be super careful in how we define our product at any given stage so that we can actually figure out what's relevant when. The last thing, don't overthink data correction. We do not usually need to do some complicated magic to derive information we suddenly need from information we have. We can just go to people and say, okay, tell me this new thing. In a living software system, users are as important as code, and often humans are much better at solving these kinds of issues than computers are. When we approach data integrity problems in a sphere of collaboration with our product and our user base, intractable problems become tractable. So that's how you solve product changes. Now earlier, I was saying that eventually, we found the solution that was us scoping down data and creating a disambiguation interface. There's a reason I was saying eventually. There was perhaps some miscommunication on this project before we got to be eventually. And again, it's been a few years. I forget the exact course of events here. But in general, my team assumed that upstream would be able to provide us with vendors. We assumed that when units didn't have their vendors marked, this was actually a bug and furthermore, a bug that we were suddenly obligated to correct. This was actually pretty arrogant. Sometimes upstream just doesn't have the data. So the data for a return to vendor was mostly sourced from other modules, Returns Receiving Module. Returns Receiving is a fancy warehouse logistics term for the folks who log and sort through boxes and return goods. And you know, the folks receiving often just didn't know who the vendor was when they tagged a returned unit for processing. Like let's visualize here, right? Like you're a warehouse worker, things are coming in off a truck, there's a big pile. And you can hope that the packages are going to be nice packages that people actually put return labels on properly. 90% of the time that's true, right? Like most people are not jerks. Some people are jerks. Sometimes people don't label things. You need to figure out what the merches without that reference point. Sometimes the box has a brick in it or nothing in it. Sometimes there's like all these permutation of ways that things actually get kind of weird out there where people are actually doing their jobs. And we're talking about warehouse workers here. Their job is not to like do this in-depth research on exactly what brand of brick just got returned. Their job is to log everything they can figure out quickly and move on to the next thing. They are measured pretty aggressively on how fast they can do this. And because this is blue collar workers in America where the unemployment rate is not so great, there is a super huge correlation between their job security and how fast they can do this. So if we force this process on them where suddenly they need to think really hard and slow down, we are the ones being the jerks. Also like if you have a brick in a box that doesn't have a vendor. Sometimes your business processes desire for data is about as realistic as my 10 year old desire for a pony. Anyway, I didn't know any of this at the time. All my team was really aware of was the very local needs of the RTV module. And because all we were aware of and all we were paying attention to was these very local needs, we ran some migrations that might have screwed things up maybe a lot and it made it production. Now we could and should have not done that. We couldn't should have stuck some new validations in the module rather than running this intense destructive set of migrations that unmarked a lot of things for our return to vendor. And we did not do this because when we just added these validations on our own, CI failed. We assumed that CI's failure was us uncovering yet another new bug in this process. Again, we were kind of arrogant. Instead of assuming as we should have that CI failing was a sign that we had just broken things. And the root problem here is that we isn't even that we were arrogant and ignored what CI was telling us. It is that in the process of ignoring what CI was telling us, we went off and did a new thing instead of talking to returns receiving. All this story that I just told you about like how things actually worked in the warehouse we discovered at a cross team retro a few weeks later. And if we had slowed down a bit and like walked across the office and said, yo, John, what's up with this? Then we would never have been tempted to run this destructive migration that created a production error. The snarky way of putting this is that all we need to do was talk to each other. Or as the returns receiving lead may have quoted when we're doing that cross team retro later. Individuals and interactions over processes and tools. This is true as far as it goes. But just is and remains the most evil word in software development. Whenever it pops up, you can be sure that lurking underneath someone is radically discounting the effort something takes. Or even worse pretending to do so as a weapon to get what they want. Here's another quote. It's a bit condensed. The full quote that Camille Fournier did, she's the former CTO of Rent the Runway, is the amount of overhead that goes into managing coordination of people cannot be overstated. It is so great that the industry invented micro services because we'd rather invest engineering headcount and dollars into software orchestration than forced disparate engineering teams to work together. And this is RailsConf. There is like some dogma joke place I could take this about micro services, am I right? But let's get past that. Let's listen to what she's saying underneath here. What she's saying in a lot of ways is that if we allied the cross team communication problem into just talking to each other then we are setting ourselves up for failure. Point of agile is not that processes and tools have no value. Is that focusing on individuals and interactions is likely to produce more value than declaring a specific set of processes and tools to be correct? Similarly, observing the way the software actually behaves is always going to be more accurate than reading its documentation. What is going on there? Anyway, it is from the combination of these two facts that we build our mechanisms for making sure communication happens. I mentioned before that when my team added this validation, see I started screaming. If we don't get blame on the failing tasks, some of them would have popped up with recent commits from folks on returns receiving. We could have gone, yo John, what's up? And this idea didn't occur to us. We decided we heroes. But let's go a little bit deeper. Part of fixing interactions so we don't need heavyweight process isn't just building in these lightweight cues like validations. It is in being able to listen to these lightweight cues. This culture or this organization had a culture of pretty isolated teams where everyone had their goals they needed to pursue really aggressively because a certain person in upper management was breathing down our necks all the time. And that created a situation in which we simply could not be receptive to folks from other teams initiating these kinds of conversations. When we build this accidental cultural norm that your own work is the most important thing, we are also saying communicating cross-team is a waste of time. It is never a waste of time. So we've created this rubric for solving each individual special snowflake communication problem we encounter. There'll be a lot of them. And so next up let's move on to not using well-known data integrity patterns. Now this photo, of course, is from one of my favorite Star Trek episodes, the little-known 1983 Christmas special. Anyway, people forgetting or just not using established data integrity patterns is one of the biggest opportunities we have to trap ourselves into moralizing when we're talking about data integrity. This is because it's an obvious example of human error. But when we actually want to limit the impact that human error has on our systems, blame and moralizing lead us in the exact wrong direction. We need to look at the ways the system actually encourages this human error and build in ways to recover from it. A note that I'm stressing recovery, not prevention. Prevention is let's add more steps of horrible waterfall paralysis. Recovery is let's acknowledge things and move on. You're not intended to read all the code in this slide. You are merely intended to note that, oh boy, there's a lot of it. This is symbolic representation of the way that a lot of older rails projects might acquire validation sections that are maybe five times that big, or callback sections that are maybe 35 times that big. And all of this is a way to try to build in more modeling to escape these data integrity issues that pop up in older rails code bases. I could say that all of the modelists don't have that enormous list, have a bunch of nil checks running around, but that would be a total lie, because we all know that those modelists also have nil checks running around, all over the place. We can make a lot of grousing jokes about this, but let's not. That leads us to blame culture. Instead, let's look at how and why the situation is actually terrible. This can be parallel to developers. And it encourages developers to get past this paralysis by skipping things. I have spent a lot of two-day streaks going, oh, hell, what magic incantation will get factory-grilled work this time? I literally wrote a fixture concatenation gem to get past this problem, because it was so bad at one particular place I was working at. Sometimes skipping validations is a totally rational choice, which is scary, but true. And we need to get past this not by blaming people for having skip validations, but by reworking our systems to not encourage them to do it anymore. And we also need to remember this is just as paralytic to users. If we think about our poor warehouse workers earlier, imagine what it's like if maybe people aren't paid to work on it, and how your conversion rate might perhaps suffer a little, a lot. Requiring a lot of things up front encourages people to either give up or just enter some fake data. This actually happened very recently, and I work for banking software, so you'd think that people wouldn't just enter fake data. Sorry, rant over. But people just do a lot of things to get past that annoying red validation box, and you don't want to give them that chance. You don't want to lead them to that trap. We make folks deal with the full complexity of our systems. Oh, hell, how am I halfway through already? At once, we are making our system unusable for both users and developers. And so how do we get this bit more real-world usable? Let's start thinking out the actual business processes involved about figuring out what's needed when. Here, we aren't doing everything at once. We're using a conditional validation to validate based on state. Validations shouldn't fire unless a given fact about the model's true. We can also use Rails custom validation context to use something similar. Here, we're specifying what context we should use when we save. We can even go a little further away from the designated Rails happy path and use reactive model service objects. And the trade-off we're running through here is flexibility versus cognitive load, because when we go further from this Rails happy path, we are also increasing the amount of cognitive load we need to safely invoke stuff. When you're dealing with an app where you can just sling everything in the model, it is super simple to save the model. Custom validation context mean remembering more things. Service objects mean remembering even more things. Now, this seems subtle, you think, but people just remember to look in the correct service object directory and do this. They will not. They will remember the norms of default Rails. They will look in app models and they will screw up while using your pretty non-railsy service object design. They will do this for entirely normal programmer reasons, like this is a Rails app, things are in app models. I am a big fan of non-railsy service object designs, but they do increase the cognitive load that it takes to work with the system. And we need to be real with ourselves about that and actually address the data integrity bugs that can result. You need to build in a culture of pairing, thorough code review, et cetera, to backfill this discovery ability problem. Another thing that we can lose when we move away from Rails defaults is the way Rails defaults invisibly help us use database transaction correctly. Database transactions, for those of us who are less familiar, are a way of grouping database queries, one, two, three, four. All of these will standard fall at once. It's say query three fails, query four doesn't fire, and the entire thing goes away as it never happened. Rails callbacks give us this for free. And I am a fond of callback hate as the next person, but Rails callbacks do give us this for free, and the bottom example doesn't. This is super dangerous of a unit say fails. Oh, sorry. If unit three of five doesn't save, we are going to have some internally inconsistent data, a shipment we marked as shift, but not all of its units will be. We can fix this by wrapping things in a transaction, but remembering to do that is much harder than you'd expect. And so we need to make sure we're also having a base class that wraps it in for you. We need to just do this instead of going self-discipline, self-discipline all the time, because no one helps self-discipline when they have the flu. Anyway, like the reason Rails callbacks give us this for free is because I don't want to bother with this when I am working on a feature. I want to work on the feature. That is what computers are for. These things also get a bit rougher when third-party services get involved, or when asynchronous jobs come in. And really third-party services and asynchronous jobs are in many ways the same thing. Amy Unger is going to go a bit and more into this probably in her eventual consistency talk right after this one. In each case, there are code that we cannot easily recover from failure with transactions. If a worker IQ up and a transaction fails, that worker cannot reach magically back in time into this already completed transaction. Go, nope, sorry, lol. And similarly, if I make an external service call and my code comes across an error later on in that transaction, tough luck. There are ways to be distributed transaction systems. I advise you not to do this. It is very hard. So if you don't take this advice, probably will still want to implement the suggestions about catching mistakes that are about to give you. Brief regression. This is a data integrity talk, therefore I'm like contractually required to talk about the CAP Theorem. The CAP Theorem was also known as Brewer's Conjecture until Nancy Lynch proved it in 2002. Talks about what can and can't happen with data and systems involving more than one computer. On CAP stands for consistency, availability, and partition tolerance. Consistency is our updates applied in the order they received. You'll also hear this property referred to as linearizability. Availability means exactly what you think it does. And partition tolerance sounds fancy, but it really just means that the system's behavior is predictable if a server crashes or a network connection cuts out. You know what computers do. The CAP Theorem says that you can pick two, but you can't have all three. And really what the CAP Theorem says underneath is consistency, availability, pick one. To quote, quote, a hail in a really great blog post, you cannot sacrifice partition tolerance. Computers fail, everyone. If the server goes down, then you can either pick consistency by having refuse connections or pick availability by accepting potentially inconsistent data. You do not get like any other magic third option. As my coworker, Makayla, puts it, all systems exist on a continuum between safety and liveness. Guaranteeing data safety, safety will reduce the liveness of your system. And similarly, you might wanna prioritize liveness sometimes. You need to be able to accept that your data is not going to be perfect if you do. This is in many ways conceptually parallel to the idea that sometimes we can abandon, compensate, complicated validation and callback systems to make our code bases more livable. But this does come with the additional, let's think about this harder, maybe we'll make mistakes straight off. And this is okay as long as you're building systems that offset these negative consequences. In terms of those offsets, I have had a lot of success running lightweight audit processes like every five minutes, every 90 minutes, maybe through something complicated every week or even quarter. You should do some basic consistency checks. Like maybe they sum up all the entries in your accounting database and make sure that, yeah, everything adds up correctly. Or maybe they make sure that every order that is marked as shipped actually has associated shipment. And then you wanna escalate these issues that crop up to real humans if discrepancies are found. Discrepancies will be found. It will be really noisy when you turn it on first. I have worked for a company, I'm not going to name it, where we turn this system off because it was finding a lot of discrepancies. You're going to be worried even that the system is the thing that is buggy and not your data. Your data is the thing that is buggy, I am so sorry. But work through the list, it'll get better, I promise, slowly. And another thing that you can do with all this seems bit heavyweight that's kind of spiritually similar. Stop nil-checking. Let your error tracker be this audit system. This sounds like a joke, but it is not a joke. And the reason this is not a joke is because the reason we really do our nil-checks, right, is that we have data integrity bugs. We go, bugsnag says unexpected method, whatever, for nil-class, and we go, oh, damn it, shut up bugsnag. And we just slam this nil-check through to deal with the proximate issue instead of investigating the root cause. And so I'm suggesting maybe we should stop doing this. Or even if we do need to shut bugsnag up, we start logging exactly how many times this particular piece of data is nil so that we can determine how severe the bug is and prioritize doing something about it. Don't ignore system feedback. The reason that these systems work is that they're fundamentally porting a DevOps practice, namely monitoring real-world performance and adjusting our systems accordingly and applying it to data integrity. The other thing that's really powerful about these systems is, again, we're re-involving humans. If we're escalating issues to support personnel, then support personnel can deal with these artisanal one-off data issues in a way that's much more efficient than having a programmer trace through and figure out exactly where Postgres went wrong on the fifth full moon of the month. Again, in order to get data integrity efficiently, you need to recognize that software is a living system composed of code and people. That's how you get past situations where you can't or don't use well-known data integrity patterns like validations, like transactions. So next up is bizarre computer nonsense. This one goes by real fast. But no, seriously. Like the things I've talked about so far in this talk, having a strong product understanding, making sure you have good inter-team and cross-team communication patterns, building active practices around data integrity checking and making sure to get humans involved when you have data integrity issues. These are going to save you when things go really weird because fundamentally, when there's an issue, whether it's a simple issue, like you milled out because a worker failed, or because I don't even know computers can be weird, the important part is knowing that things are wrong and then fixing the thing that is wrong as quickly as possible, preferably before the customer notices, or the FEC notices in the case of the fundraising company I used to work for. You can fix things if you know they're wrong, so it is not so bad. You need to be aware of your product, its needs, and how these needs are likely to evolve, but that's called being a business savvy application developer. You need to be aware of communication patterns on your team and between other teams, but that's called being collaborative. You need to be aware of how a cognitive load and the things you don't know can lead you to make well-intentioned mistakes, which is called being a cautious engineer. And finally, you need to be aware and flexible when computers inevitably computer anyways. I got nothing for this one, other than my paycheck is real nice. This is all stuff you need to be cautious about, like when I put it that way it seemed easy, but it is not. There is never such a thing as easy, there is never such a thing as just. The big overarching theme of this talk is that this stuff is hard and you are going to mess up. These are complicated problems, they require attention to detail, and that is hard to maintain over a long time of working fast. But if you build in mechanisms that let you recover from issues quickly, these issues might as well not have ever happened. You don't need to fall into analysis paralysis. You don't need to fall into here's three million and five and an extra two waterfall steps, which will actually be counterproductive and send everything to the place anyway. You don't need to resort to unhealthy finger pointing. You don't need to do all of these struggling doomed attempts to prevent the issue next time. You can just fix stuff and move on. So here's me, I'm Betsy Hable. I go by Betsy Muffin on Twitter. There is very little programming, there's a lot of cats, and also some feminism. This talk slides and a rough transcript will be posted at betsyhable.com slash talks at some point within the next week or so. I work at a company called Rustify based in San Francisco. We are not currently hiring, but we'll be reopening, hiring for senior developers soon. And if you want to help mentor some of the best junior and middle-level developers that you could be privileged to work with, then I highly encourage you to apply. I co-organize a group called Learn Ruby in DC, which is a casual office hours type thing for newer programmers in the DC area. And if you'd like to do something similar in your town, talk to me about it, it's not hard, you just need to show up. And I've got a few minutes for questions. The question is, are there any tools that can help us check the data in our databases? And this is going to sound like a fissile answer, but sidekick. Like seriously, whatever asynchronous or scheduled job runner you're already using anyway, you can build these audit systems that I'm talking about by just having a thing that emails you or sends a bug snag if it finds a problem, run it every five minutes, it'll be fine. Or if it's a really expensive calculation, maybe less than that. Don't overthink it. The important part is that you build something quick and sustainable, and you start incorporating, you can start incorporating more things over time. Oh yeah, so the question was when we have these transaction script objects that are external to Rails to help us remember to wrap things in transactions, we run the risk of forgetting to use that object, which is very real. And there isn't much that can help you with that other than a strong culture of code review. Like I've seen some people make more complex things. There's a talk Paul Gross gave a few years ago at RubyConf about like monkey patching rails to get very angry at you if you step past things. I think that's more complicated than you wanna do. But even having like this cultural norm of every single object in this folder must inherit from this object does do a lot to help because then you've made this obvious point where things will jump out at you if they are wrong. So people are more likely to catch things in code review. Yeah, and so the question was a lot of the times when we're trying to deal with particularly bad queries that active record might generate by default, we wind up resorting to raw SQL, which of course naturally bypasses Rails's, callback mechanisms, validation mechanisms, et cetera, and therefore also bypasses the business rules that these are maintaining. And so how do I, how do you deal with this? And again, there is no like perfect answer here, unfortunately, one thing that you can consider doing is, and this is Rails' conch, so I'm going to like get in trouble for recommending that we not be database agnostic, but don't be database agnostic, folks. Like it is okay to use PostgreSQL's built-in check mechanisms, for example, push more of these things into the database. This is going to lead to some flexibility issues. Like you can start running, like check columns only work within a table. Anything outside of a table and you'll need to start thinking out triggers. And having worked on a system that actually did need to use this to maintain data integrity because we were pushing so many things into SQL for scale, you can check out my former colleague, Brawley Ocarina's talk for a little more about the other scaling issues that actually faced. Plug, yay! It's later today. You need to be really careful, like Rails doesn't have great feedback mechanisms for reporting back the results of these kind of triggers failing, for example, back into Rails in a way that will make this easy to bug. And so this is a really hardcore choice that you should make sure you're making only if you really need it. And so a lot of the time, I would definitely advise that you think really hard about whether you are best served by doing everything in SQL or extracting things to a service object that imposes these business rules or whether you really do need to push things into database. Sometimes you do, but it makes it a lot harder to debug. And it sounds like from that beeping that I'm out of time. Thank you all.