 Welcome to my talk. I'm Jeremy and we're going to talk about systems design, something I don't think we do enough of. Cool. The URL to my slides is apparently up there. All right. For the past three years, I've worked at kind of security. Our application has changed a lot in that time. It's had to evolve and become more distributed as our organization has grown and we've just been processing a larger volume of data. And we've learned a lot from that process of migrating and refactoring our system. But it's hard to find resources of how other companies have gone through that. I'm leading one of these efforts to pull apart some pieces of our application and going a bit like this. Not always, but sometimes. And I'm constantly afraid we're making these separated but interconnected towers of responsibility that we think aren't encapsulated, but just aren't. A couple years back, I've read this book and it's constantly amazing me how much it translates to these kind and many other problems. But all the examples in it are about code and how to rearrange code and how to make code resilient to change. Which is great, but I want to know how that relates to the bigger picture. So start off with some code. For example, we have this library application and this is a cataloger. I'm glad the syntax highlighting is visible up here. I was concerned. But yeah, what it's doing is it's taking a list of books from the internet and just creating a record of them in our system. And it's pretty much a good thing to get working, but we've got a lot going on here. There's a lot of behavior packed into it. And if we want to improve it, identifying the roles and responsibilities isn't too difficult because there's some great tools and advice out there to help with this. From the community, Sandy Metz, Katrina Owen, they've made a lot of talks and have made this very common knowledge and very relatable and digestible. And so we've got the squint test, for example, where we can see the shape and color of code to identify where something is reaching into another class and knowing too much about how it behaves. We've got this API client that has some configuration going on in it. There's this parser, probably shouldn't be there. And we're doing a lot of validation here as well. And these can all be pulled out and separated in small classes. And that's a much better design and it will encapsulate all of these roles. But what about this? We've taken that and put it into a worker, just a background job, DSL. And it's not so obvious where we've abstracted away a lot of the configuration and how we're in queuing work and how all this relates together. It's not so easy to figure out where I can apply a lot of these design patterns, a lot of these tools that I've learned and it feels pretty bad. And then in the bigger picture of things, where do I even begin translating that knowledge of designing these things? If my system has a worker that's, I would, I forgot to add an arrow. If one of these workers is constantly having load issues because it's dropping due to disk space or it's running out of just getting constantly overwhelmed, what are ways I can go about improving that? Where do I even start thinking about how to refactor it? And my first instinct when I run into these kind of problems is to ask my good friend Tony, that's really his cat, but he and probably his cat can attest to the amount of stupid questions I've asked over the years. That's kind of the point. We've built up a vocabulary to talk about this and we've spent enough time talking about testing, mocks, dependency injection, interfaces to, we can identify a lot of good patterns from bad and we can communicate it. So why is there this discrepancy when we're talking about our system, about our application? Suddenly once we move out of the code, we are throwing all our design tools out the window. But is this really so different a problem? We're still pretty much passing messages, very similar. And that's a question I've been obsessing on for the past year or so and obviously it was on page three of Puder. So design is about managing dependencies with your application and the people working on it. It's about being flexible and adaptable to change and having small, simple pieces that compose together. So if I wanted to go and change the data store that my reporting service is querying over, how easy is it to do that? It's really not. It's difficult because these data stores, they're traditionally interspersed everywhere. And the person who put it in probably isn't even working at my company anymore. If I'm working at a startup, this was probably designed on a white board and then immediately erased and lost forever. So as a new hire, I don't have any breadcrumbs to follow. And organizationally, I don't think we do well with this. So I made a couple of anti-patterns cuz I guess that's what I'm supposed to do. So they're mostly around communication. Talk about these things. If you're new, ask questions. Keep thinking about these topics and if you're established, be open to change. Even if you don't think you have this problem, discussing these things and sharing knowledge is important. And then maybe document it, maybe. More on that, don't have a few brains architect your system. We're in this together and there are many ways to reach a workable solution. The social problems an organization has by exclusively having someone make these decisions are numerous and completely crushing to getting work done. Newer folks should be given the opportunity to learn and possibly fail in the same way that more established engineers can. Because that's just how we learn. And not giving that chance means you're going to have engineers with hesitancy to make changes because they don't feel they know a certain part of the system. And be aware of the trade-offs. Operationally, monetarily talk through these costs and benefits of your decisions. And because we're not perfect, it's easy to make assumptions about the network. These aren't operations or systems problems. These are things that we need to be conscious about when we're writing our code. When you're writing the code, it's there in your face. You're able to read it. But these things are actively abstracted away from your attention. But they have real consequences. We haven't talked much about what's going on here. This is relying on a ton of things. It's relying on a lot of non-deterministic things about the network and the response we're getting back from the service we're talking to. But what can I actually change about this? Oops, sorry. Not much. It's an argument list passed to a method. There's not really much we can do in terms of making this, from the code perspective, more fault tolerant, or more resilient to change, or more flexible. We simply need to expand our toolkit past the code. So, these are the two best resources I know of about this. The one on the left is by Steen and Tannenbaum. And it's the academic end of distributed systems. Not gonna lie, it's really dry. But it's packed full with a ton of details. It's focused on mainly making a single service like Elasticsearch, that's providing a single function, very highly available and resilient to faults across many nodes and across a network. The one on the right is by Martin Klettman, and really gets into the weeds about the inner workings and trade-offs, a lot of the modern tools available to us from many cloud providers, a lot of information in it. So, the main takeaways from this book, at least to me, are that we want to hide communication as asynchronously as possible. You might not know how much ES nodes talk to each other, to health check to see if they're available, but it's a lot. And we wanna load balance effectively with queues. Our performance monitoring and queuing theory, these come into play a lot. And don't be afraid to replicate data. Cache it locally to the client process. Bit more into that one. Whenever we're calling outside the network, it's slow, expensive, and we're at risk to a lot of failures, a lot of faults in the network. Instead, we can store it locally into a variable. This is a form of caching. We can create a record of it in our application. This is probably the most common, also a form of caching. Or we can simply put in a key store like Redis as a familiar read-through cache. But what is not obvious about all these is when these caches expire. The invalidation question. Variables last for the entire process duration. If we have a 10 hour long rescue job running, you're probably operating on stale data. A database entry could exist forever. It could be years since it's updated or referenced. And probably different from when you imported it in that external system. The only one of these is predictable. And that's a read-through cache because we're explicitly telling it when it expires and when we have to rebuild it. That's the consistency cost. And it's one of the primary trade-offs we can lever and design around and optimize for is determining our constraints and the SLAs we need to meet on how up to date and how real time, how consistent our data needs to be. Sorry, that's a pretty dense slide. And Clutman's book, it's got a lot of content in it. Some of the topics it goes into are encoding messages and protocols between services, RPC protocols like DRPC, JSON, other stuff like that. It's got a lot of information on this. We're not talking about that. How to partition your databases. It has a lot of suggestions on algorithms to use. And I would say the easiest one is to find a cardinal value. Find something that is unique to a set of data and use that. It's okay to use data stores for specific purposes. In fact, it's encouraged. Specializing some of these is more performance than trying to make a Swiss army knife aggregation in MySQL, just as batch your IO work into threadable workloads. And then, if you have to change, or if something has changed, you can stream those changes up to what you need to. And I wanted to spend a bit more time to mention events because there's a lot of misconception that stream processing is the same thing as event sourcing. And that's just false. There's two ways to transfer information. You can either transfer it by function, as seen by this nice arithmetic example, or by result. So if you're going to put all these into a log or a queue, they have different properties which might be useful or detrimental to you, but you should be aware of that. The latter one is order dependent. So this makes it very difficult to distribute work due to different computers have different clocks. The time synchronization problem. Academics love that one. But this has the really awesome property that you have to just ensure that the last right wins. And if you can do that, you can ignore all the rest of the operations. You can just be like, I've already written the final result of this. Just process the queue last and first out. And it's awesome. Likewise, on the left here we have a very cool associative property where we don't need to maintain sequential integrity. We just need to make sure everything has processed. So we can distribute many different jobs in parallel. They just all have to finish. But if we want to recreate the state, we have to process all of them again. It's a long way of saying that these are design patterns. They're choices to make. It's not a silver bullet. Make meaningful choices. And finally, in my opinion, the biggest takeaway from this is determining systems of record and derive data in your system. Poorly represented by me here as functions. As you can see, function f relies on the output of g that relies on the output of h and so on and so forth. We have to process all of them in order to process the final one. And they have to be processed in order. So if I make a change to how function h behaves, I have to reprocess all of the other sequential work. Alternatively, if we pretty much have all the functions pull from a single source of truth, a single system of record, if we can make changes, they're not dependent on each other. It's basically the difference between having a law of demeter violation and a composable object. This is a code smell and a place for optimizing your system. So look out for it. And full circle, by thinking about our system in these terms, we can tease apart where some dependencies and design for what they depend on, which is the data. Some rules of thumb, use the actor model. Communicate through message cues as much as possible. Make your work asynchronous. Make workers retriable, have the work they do be idempotent. There's a reason this is in the 12-factor app. It's very strong. And precompute as much work as possible. If you know what you need to display to the user, do it out of ban from when you have to display it. Let's see. Yep. Identify where data is derived in your application. This gives you some nice seams to make asynchronous. Determine the consistency costs and the constraints in your system. Pay for what you need to be real time and relax the constraint wherever you can. That will allow you to keep the interactions between services in your system simple. And with that, I want to spend the rest of the time revisiting our library example, making some building blocks, and showing some of the evolution of our system at Kenna. So building blocks. Yep. The first major piece here is the worker. This is how we process asynchronously. If we're transparent with failures, if we're making these things retribal, we can just kill the process when it faults and push the work back onto the queue and let it reprocess. Workers pick up the work from queues, pick a format for messages, and try to optimize simply by the queueing behavior. This lets you avoid redundant work. Materialize your data into a view. Precompute what you can. Invalidate caches in a very predictable manner. Have artifacts built from expensive computations so you can circuit break and default to the value from yesterday if the job failed today. And these give you very simple interfaces between the parts of your application. So let's look at some universal behavior that I think most applications have, like interfacing with an external API, or maintaining relational models across the network, deriving data points through business rules, and performing computations on our data set. And finally, high read, low write, spoiler, that problem is the one you want tooling for. Don't reinvent the wheel, guys. Say we want to display reviews from good reads or an external service. How can we go about sourcing these into our application for the user? We can simply embed this directly into the front end. This gives us as hard a consistency as possible, because we're literally displaying what is in the external service, but we're relying on the network for every render we're making. But if we put that into a background, or into the back end, we can have a read through cache, or we're making a single request. We still have a real-time render, but still a real-time request on render, but it's an improvement. And if we know what we'll need, we can simply use a worker and process this in the background. We can make this query asynchronous and retrieval for information if that fails. And we can figure out what we need to display to the user and push that into a highly available data store. So it's not actually hitting our server. However, it's hard to determine what changes, because not all APIs give you a change set. Hey, here's everything that changed in the last day. Doesn't happen. We can't pull in the entire Goodreads data set. It's just simply too much. So this isn't always applicable. But what if we want to make updates back to the external service? What if the user wants to add a review? This becomes a pretty tricky problem, simply because we don't manage what we're keeping consistent. One option would be to use a client consistency model and display the new content exclusively to their JavaScript session. The user thinks that they have updated or added a review to the Goodreads system, but we're processing that at our discretion. And we can batch these into one thing if there's an API request limit. So there's a lot of good here. But it's obviously given us a bit of complexity. And we don't handle authentication errors well. And we're technically lying to the user about what we've done in our system, which isn't always the best thing to do, but sometimes it's better than the alternative. Another approach would be to update our internal state and invalidate the cache with a priority queue. This is most often what we're doing at Kenna. It's worked very well. It'll work for like 90% of your use cases. And we can still process things asynchronously and retry on those. You get a lot of benefits working with this, but you're also making trade-offs. So figure out which ones you can tolerate. So for relational models, how do we maintain relational records when the data is in separate places? Say we want to add a book and an author and display that to the user from a background job like our cataloger. How do we maintain transactions and joins across multiple tables? These get really complicated by service-oriented architectures. In the monolith, this is fairly straightforward. We can wrap each row into a single insertion into a single transaction. It just works. It's straightforward. It's simple. It's very reliable. And you have guarantees of consistency from your data store if you're using a relational database, I guess. But you're given a hard upper bound on the amount of data you can have in one place. And if you're approaching that, you will get significant performance hits, much like we did. So we partitioned our database by a cardinal value. Clients, in our case, because we process data client specifically, and this fit into how our role-based access scheme was working anyways. And we keep a lot of the good qualities of having a relational database store, all of this. Transactions and joins are still easy to do. But it's very hard to retrofit. ActiveRecord doesn't handle database connections for you. They're working on it in the new Rails, but it's not there. We can't table scan across multiple databases. And you're going to end up with hotspots, like a client that has a million vulnerabilities, in our case. But it's not perfect. The first suggestion the internet will tell you to do is to wrap everything in an API. Divide everything by foreign key and give it a stressful interface. It's the simplest translation you can get from how we store information in a table. But joins are non-existent. Transactions aren't guaranteed between two separate tables. And you're simply relying on many more services to have availability to display information to the user. I think there's a lot more bad than good there. So then the internet will tell you to keep transactions in lockstep with each other. You want to have an event stream, or event log, and then transactionally update things. And when one fails, you just roll everything back. In blog terminology, this is a saga. And you get much better consistency, or you get much better, what am I trying to say? You can ensure that a lot more things are the way they should be, simply. But we've coupled our distributed processing together, where the one data store is now relying on the other one to finish processing before it can begin. And we're still left without an ability to join between these two services. So it's still not great. But what if we simply didn't care? You're not going to have perfect data in your system. So plan for it being bad. Choose what you're going to display. Invalidate the relations you don't have, and version the ones that you do. Whenever anything changes, have a worker serialize that change set out. Precompute what you're giving to the user or another service. Materialize the joins and optimize subsequent batch calculations. Circuit break. Sorry. Problems you get with this is that you have now given yourself something to keep consistent with the rest of your services. And it's a bit more complex. This is the evolution we've seen our system take as we've hit the different limits of the data stores we're using. Plan for the problem size that you're experiencing. Don't over optimize. Use the simplest solution, but anticipate it might have to change and might have to grow. And you might have to go into something that's a bit more complex. So determine where those parts of your application, where those exist in your application. So I guess I shouldn't speed up. So how do we describe business rules in our system? Say we want to source new records from multiple social media platforms. The naive approach we can have is find or create. We can have a simple transaction-based create, record creation. And we can lock on records. So we're eliminating some race conditions, but there's just the inherent race condition of processing this way. We lose information about the sources. We're actively overriding information going with this. CRDTs are cool. I'm not going to talk a lot about it. It's a data structure that knows how to resolve rights. Martin Kleppman has a cool talk on this. I recommend watching it. In reality, we probably would want to save data as close to the source we're receiving it from as possible. We'll have as clean a form of the data. And we can normalize it together into a way our system wants to use it. These don't have to be separate services. It's just drawn that way. And they usually aren't. These are pretty much separate models or tables. And it's easier to debug inconsistencies with the information you're getting from the external API when you're not having to use the external API. But you have to maintain consistency with the source. You have different abstraction levels on what these records mean. But what are we really doing here? We're, if modeled through an event log, this is really just a derived view of different parts. This is really just a derived view of different records. We can push out different versions of our business rule algorithm and compare their effectiveness. We can version these. And suddenly, we have a very powerful tool to compare work we've done, I guess. I'm not sure what I'm saying. So let's talk about computations. Effectively, how do we query? How do we compute? How do we make sense of our data in a way that we can provide value to our customers? The first thing you should do is publish an artifact. If you are updating an attribute on a record, you're doing it wrong. This is solely derived data from an expensive computation. If you have a distinct artifact from it, you will be able to just circuit break and use the one that exists if your current computation fails. You can also detect anomalies by running some sanity checks before pushing it live. You can compare improvements and regressions between your computations very simply if they are in files. There's not much bad about doing this. And the first improvement we can have over a table scan of our entire data set is to batch it up. Just find and batches. It's easy. Active record makes this kind of thing very memory efficient. You're still IO bound here because of table size, run into hotspots, all that stuff. And if we've partitioned our database, this rolls very nicely into this. We're less IO bound, but we still have the problem with hotspots and skews. So the term for this or the algorithm for this kind of work is map reduce. If we want to include more context, more data, into a job, we can just add a simple step that translates or joins data into our computations. My thinking on this is instead of having a step in our processing join this data together, why don't we simply materialize a view and batch from that? We're reusing a seam in our system. We are operating on exactly what we're displaying to the user. If we have it well versioned, depending on the data store you're using for this, it might not be optimized for subset querying. So one thing we have done is to use fact tables and change sets. We publish a CSV entry for changes and send that up to an optimized data store like BigQuery. It's kind of hard to debug when the numbers don't add up, and you're effectively monitoring the entire state of your data, the entire data set you have. So it has some complexities there. Tooling, yeah. So this is pretty much what we've done in our system. Actually looks fairly similar to this terrible drawing of mine. We have two primary job queues, and the one on the left is what we need to process and normalize into how we manage data, or how we pretty much our business platform. The one on the right identifies what we're actively serializing into Elasticsearch and what we're providing to our users. Subsequently, it's two primary map reduce functions. We're importing data and normalizing it together. We're reducing it into a normalized set of vulnerabilities in our sense. And then we are pretty much fanning out and then putting it into search indices and documents that are readily or quickly available. As you can see, most of this is completely done asynchronously, because we are putting everything into queues and pulling from it with workers and serializing stuff out. The question often here becomes, how do we compose these data sets together? And at what points? And if you want more information on that, better numbers. My friend and colleague, Molly Strube, is having a fantastic talk on this tomorrow called Caches King. I highly recommend it. It's awesome. Because in the end, this is really about simplicity. It's how you can reason about your system, and it's how you can collaborate with others on it. All the content I got, thank you so much for listening through it. For more on this topic, these are some of my favorite resources. Yeah. Thank you so much. Here's some contact information.