 Distributed system. So by rays of hands, how many people here are currently working on some sort of distributed system or considering working on some sort of distributed system? Oh, definition. We're just using a less loose definition that it's multiple applications working together. So hands back up. All right, that's a lot of us. Okay, so this is kind of a unique thing. Coming from our background, the shared background that we all have in this room, we're all very familiar with the huge benefit that we get from Rails' conventions. We know that it allows us to move very fast because we can start with a sensible set of defaults and only deviate when it makes sense for a particular application. And so the company I'm working for right now, we're called MX, we've been working on a distributed system like this for like the last three and a half years, and there's a lot that we've learned along the way. And most of the things that we've learned that have turned out to be really valuable, we've encoded those into conventions on our team. And there are things that we talk about a lot. And so basically we're gonna be doing a little bit of show and tell. But it's really hard for me to try to just show you a list of here's the things that you should do, like here's the things you should do, right? But this won't mean a lot to you and it's gonna be really hard for you to decide if that's valuable for your team or your distributed system. So what we're gonna talk about this, these three ideas, is that we're gonna build a sample project together. So welcome to your first 30 minutes of being a junior engineer at MX. This is basically what it feels like for everyone when they come onto the team, including when I came onto the team. So your first job is that whenever a user gets deleted, we wanna clean up all the data for that user throughout the system. So that's our sample project. Hopefully I'm gonna use as much real data and real examples as I can during this talk so that I'm hoping to give you an idea of what pain we felt as a team so that this becomes transferable knowledge to your context. So this is the system. There's 18 Rails applications that are running together in production. Whoa. I sounded like I had a really deep voice for a minute there. I kinda liked it. So along the bottom and green, we have, we call them the front end applications. These all deal with public HTTP interfaces, but none of them manage their own database. They all delegate that via RPC to these other services. The blue services, we call them core services. Each one of these is managing a database of information and then along on the sides, we have these periphery services. On the left are some services that help manage the way we connect to banks and credit unions to get information. And on the, oh, sorry, that's on the right. On the left is two services that are about aggregating and analyzing data in some sort of a big data-ish, data scientist-y sort of a thing. So where are you gonna get started as a junior engineer when we tell you, go clean up user data. Here's 18 applications. And if you can imagine there's 18 applications that probably means there's a whole lot of internal gems and other things, and are you just gonna start reading every read me for every repo and try to figure out where to start? It's gonna be impossible. So the first pro tip is if you're gonna have a distributed system, make a directory for yourself. And the two things that we really wanted out of our directory was that it should be very easy for humans on the team to decide where they need to go look for something. And the second thing is that out-of-date documentation makes all of us want to kill other people all the time. So make sure it's up to date and have confidence in your directory by making sure that the production code base is using that same directory in order to accomplish its work. That way, if the documentation is out of date, things will be broken and you'll know about it. So the way that we do this is with an internal gem called Atlas. Atlas contains a bunch of different protobuf definitions. Protobuf is just a data exchange format developed by Google. There's lots of other definitions that we could use other exchange formats. The fact that we're using protobuf has a few benefits that I'll talk about in a minute, but you could substitute lots of other things here. Some companies use things like JSON plus JSON schema. There's various other ways to do it. But one of the core ideas that is in your directory is how do you divide up the responsibilities of your system? So the way we've decided to divide up our responsibilities is that each application is responsible for a set of resources. And so whenever you're talking to people on our team, you'll end up hearing the word application and resource quite a bit. And in this talk, you're going to hear those words a lot. And so if you wanted to know about what's happening with the user, you can go look in the directory for users and see which application manages users. And you're going to find pretty much everything you want to know about users there. Another nice thing about having it be a private gem is that we get to version it with semantic versioning. And so all the clues that you want to give other team members about when it changes backwards compatible versus not backwards compatible, you can use cues that that developer already knows from working with gems. So let's look at an example of this. If I go search in the Atlas project for user, I'm going to find this user.proto file. In the top, you can see that this file is in the Amigo directory. So immediately I know that users are being managed by Amigo. I can also go look down here and see the definition of what a user looks like. This will tell me the shape of what a user looks like as it flows between the applications in my system. If we scroll down to the bottom of that file, you'll see a definition of a service. So protobuf defines RPC by saying there's services and every service has RPC calls it implements. This is like a low level detail that on our team you're almost never going to actually think about. It's sort of nice that we have the flexibility that if we wanted to go outside of the convention of each application manages resources, this would give us that flexibility, but it's a layer of abstraction below what we usually talk about. And so generally what you'll see is each resource is going to have a set of RPC calls, but all of those RPC calls will be owned by a single service or by a single application within our architecture. And you can see here that an RPC is just basically made up of a name or a quest type and a response type. And the fact that these have types is a little counterintuitive because you think that a bunch of people who love Ruby to death would not want to put static types on these messages, but it turns out that having static types is really valuable in certain cases, right? And if we all are honest with ourselves we know that static typing has its time and place. So why is it valuable for us? Well, has anyone here ever designed an API and had that API stay the same forever? All right, we're gonna hire you. So this is like basically impossible, right? You're always gonna get it wrong and whatever you do to fix it that's gonna turn out to be wrong later too. So what you wanna do is upfront, you wanna have some idea of which changes are gonna be backwards compatible and when you need to make a non-compatible change you're gonna wanna have some sort of a deprecation lifecycle so that your system can move forward in pieces and not have to change the whole thing every time you change one piece of a message. So this is something that Protobuf does really well and we'll walk through a quick example. So let's imagine we've released version 1.0.0 of our Atlas gem and it contains the definition of a user that just has a grid and a name. So then we have two applications in our service and one of them needs to know about a user so it's gonna send a user search request to Amigo and Amigo is gonna send back a user and that user is gonna be a 1.0.0 user it just has a grid and a name. And of course the product team is gonna come to you right after you release this and just say, guess what? Users need to have emails because we need to email people because every company needs to email people and annoy their users and so now we have emails and so the Amigo project is gonna have had a database field, we're gonna update the gem but right now Newman still has the 1.0.0 version of this gem and so when it sends over a request it's gonna get back a 1.1 user object. And so what's gonna happen? And this is the key question. Whenever you change something in your system if the developer working on it has to ask is this gonna break something? You're already like in a really really bad condition and your chances of breaking something are really really high. So what you wanna do is know what the semantics are up front of which changes are backwards compatible and which ones are not. So in this case adding a field totally backwards compatible it will just ignore the field that it doesn't know the definition for that field. If you follow Martin Fowler he calls this idea a tolerant reader but again what we don't wanna do is say the developer working on this has to guess whether or not Newman and every other service in the system whether or not they are all tolerant readers. Like you don't want it just to be kind of a loose convention that you tend to sometimes follow on the team. You want this to be something that's gonna be pretty well understood and followed very closely on your team. In this case Protobuf actually defines a set of semantics for which migrations are backwards compatible and which ones are not. And up front that felt a little bit constraining to our team but as time has gone on that has proven over and over and over to be really valuable for us to know up front every engineer on the team can know whether or not a given change is gonna cause problems for other people to consume that message. So we do that change and then don't you know what product team decides having a single name field not awesome enough. We need a first name and a last name separately. So you can imagine we release a version 2.0 of Atlas. We no longer have a name field. Now we have a first name field and presumably a last name field. And now when Newman gets back this user 2.0 object it's gonna break because Newman might have been using that name field to populate an email or something like that and now it's gone. And rather than failing somewhere random inside of the Newman application we would like Newman to fail as soon as it tries to read that message. You want it to fail closer to the boundaries of our application rather than somewhere deep inside the internals. And so this failure is gonna happen as soon as Newman tries to read that message. So instead what you do is you release a version 1.2. You mark a flag, you put a flag on the name field to say this is deprecated. This field is gonna be going away soon and you add the new fields. And for a little while Amigo's actually gonna populate both of those fields whenever it is sending out a copy of a user. Now when Newman reads, when it asks for a user and it gets one back it can still use that name field but the next time that Newman gets a bundle update and it runs its specs it's gonna get this deprecation warning. And so again this is like having these tools in your team makes it really easy to communicate across your whole team about which things are changing, which things are being deprecated without having to have some sort of like all hands on deck meeting or we're gonna send out an email once every Monday with which fields are being deprecated. You don't wanna try to globalize these decisions. You wanna keep these decisions localized to each team as they're making progress and moving forward with their application. So wrapping up real quick. What do you want out of your directory? The things I want out of a directory on a distributed system is I want to be able to easily find what I care about. I wanna have confidence in it because it's being used programmatically. I also want to have some sort of plan for how we're gonna change things without breaking and I want to know exactly what the semantics are around that. I want to know when something's gonna break so that I can take control of that and take whatever action is necessary. So we've made no progress on our project. We are very terrible junior engineers and we probably all know exactly what that feels like already. So let's make some progress. We finished looking in the directory. We know that Amigo is the thing that owns users. We can go look at that code base, see some of the internals. We also found when we were searching through the Atlas project that there were several other models that had a user grid on them. And the first one we looked at was accounts. And so accounts are owned by the Abacus project. And so we can do something sort of like a inner controller and the controller around the user service. We can say whenever a user gets deleted we can send an RPC message over to Abacus. But when we dug a little bit further into the directory we found out that there's also like several other applications that each have models that belong to a user. And pretty soon this user controller inside of Amigo is gonna have to know about every single other Rails application in the whole system. And it should be pretty obvious why this is like a bad idea and why this is gonna lead to a lot of pain on your system. And Rich Hickey is gonna look down on you if you do this. So in order to avoid Rich Hickey's disapproving glares we're gonna find another way. So what we wanna do is have some sort of idea of events within our system. And the main goal here is that we want to decouple our applications from each other. When a user gets deleted we wanna make sure other people know about it so that they can take the proper action. But Amigo is in charge of managing users. It shouldn't have to know about the fact that accounts need to also be deleted. It doesn't manage accounts. So we're really trying to tease apart these things. And one really important caveat here is this is a way to create less coupled code but only if the way that your mental model of your system only if your mental model lines up along these resources, right? If you were lining these up along like processes rather than resources then this would not actually give you less coupled code. This is not like an automatic less coupling thing. This is because we intentionally set up our system so that it's modeled in terms of applications and resources. So the other attribute we wanted of our event system is that it should feel very much like normal Rails code. And so it shouldn't need to be some crazy sort of like callback. It shouldn't look like Node.js or some other programming language. It should look and feel like part of your Rails application. So the way we're gonna do this, we're gonna get rid of all the complecting and now Rich Hickey's happy. And instead we're gonna stick in RabbitMQ, right? And there's, you could stick in plenty of other kinds of messaging systems. The fact that it's Rabbit doesn't really matter. But what we're gonna do is kind of ignore all the other systems for a minute and we're gonna say what should Amigo really be responsible for? When a user gets deleted, it needs to know the fact that other people care about this, but it shouldn't have to know the names and addresses and social security numbers of all those other applications, right? It should only need to know that someone cares. And so it's just gonna publish a copy of this user object. And again, it's gonna use our protobuf definition. And so it's gonna broadcast out this message just saying, hey, a user got deleted and here's what that user looked like. One other thing that's slightly important to note here is that we're trying to use restful-ish sorts of names. You can see that it's kind of names-based like application, resource name, and event name. But instead of slashes like we're used to in web programming, we use dots just because that's a rabbit thing. It uses dots for separators instead of slashes. And also it's published, we always publish these messages in the past tense, which is also a little bit different than making a synchronous call. But we can keep most of the same conventions around these things, right? We're gonna create the kinds of events that we'll publish are created, updated, and deleted. Most of the time, that's all you really need to publish. So now that Amigo has fulfilled that one responsibility that it has, now these other three systems can decide that they want to receive copies of those messages. And one really important thing here is probably not every one of your applications wants to get a copy of every single message that gets published, right? That's gonna not scale very well and you're gonna have to ignore a lot of messages. So you probably want some sort of an opt-in system as the number of your applications grows. So in this case, RabbitMQ is gonna manage who cares about what things. You can probably, well, we'll see this in a second, but you can create the list of who cares about what in a sort of idempotent way so that every time your Rails app boots, it'll recreate that same context. That way, you don't have to reconfigure Rabbit every time you deploy code. So again, we're gonna make opt-in. If Firefly doesn't care about users being deleted, then it's not gonna get a copy of those messages and that's good. The other thing it's gonna do for us is if we were just making these RPC calls ourselves, you can imagine that in production, there's probably not just one node running Abacus, right? For redundancy reasons and load balancing reasons that we just discussed. We're probably gonna have many nodes of this and we don't really need every node to receive a copy. We just need one of the Abacus nodes to receive a specific message and RabbitMQ as a message broker also fulfills that responsibility for us. So it gives us this sort of efficiency that each application's gonna receive one copy of each message that it has asked for and nothing more. Oh joy, make a little bit slower, lasers. All right, so the other thing that we don't wanna do is re-implement connecting to RabbitMQ 18 different times, right? So of course we're gonna try to abstract this. The abstraction that we found and have used over time, we call it Action Subscriber, it is up on GitHub and if you're interested in this kind of stuff, please come find me and talk to me at this conference about it. There is a poor quest up right now that we're kinda debating around being able to do message retries and some sort of a exponential back off around these messages and doing some trade off evaluation in terms of how we're gonna implement that. So if this is interesting stuff to you, like I said, come find me, jump up on GitHub and give us some more eyeballs on this problem. One nice thing about the gem is that under the hood we're using the bunny and march hair gem. So if you've ever used RabbitMQ before and you found out that you have to now require Event Machine and you have an event loop in the middle of your Rails application, that's not the case, right? We've managed to avoid that and so we wrap around the underlying Java RabbitMQ driver so things like being able to connect to a whole cluster of rabbits and letting them each go down and come up as necessary just kinda works out of the box. So an example of what this looks like, remember one of our goals was that dealing with this kind of code should feel very familiar and so if you kinda squint at this code it pretty much just looks like a controller in Rails and so even though these messages are inherently asynchronous within our system, we can treat them like it's just a synchronous call and so you end up defining a subscriber, you inherit from action subscriber base and just by defining a deleted method that's enough hints to the system that if you are following the conventions it will just figure out what route you are caring about and what type of messages you're expecting to receive and it'll just wire up all the details for you the next time you boot the process. And here you can see that we have this payload object so you can kinda think that this is the same thing as these params inside of a Rails controller except here the payload is whatever the message was that got published to Revit and Queue and action subscriber also supports middlewares so we jam on middleware in there that decodes the protobuf object so that by the time it hits your subscriber this is actually a decoded it's just another Ruby object and you can deal with it in whatever way you like. So that's pretty convenient. There's an opposite side to this coin you of course also need to be able to publish events this is the thing that Amigo is gonna be responsible for. That's not live yet it's still in an internal gem and I'm in the process of extracting it to a public gem if you have opinions about what the API for that should look like please again find me and let's talk about it. But basically the way it works right now is you have a module you include it into an active record object because most of the time the only things we care about publishing are life cycle events for a resource that we manage and so just by including that it's gonna create the after commit hooks for you and it'll publish things as necessary. So wrapping up events what did we want out of this system? We started off by conflicting all of the things and incurring the wrath of Rich Hickey. Now we want to get away from the wrath of Rich Hickey. So the things we cared about were low coupling between our systems. We wanted this to feel pretty similar to normal Rails code. We wanted to be able to opt into messages not have to ignore messages that we don't care about and we also wanted to make sure that we received just one message per logical recipient. In other words, don't send 20 copies of it just because there's 20 copies of Abacus running production. All right, now we're junior engineers. We are sort of making progress, right? Kind of, we have an idea now of what we need to write code for. So if we look at this problem now we can see that we're actually gonna end up having this one project will end up being in this case four different tasks. First we're gonna have to make sure that Amigo is publishing the events we care about and then for each of the other three applications we're gonna have to do some small deploy where we start subscribing to that message and then we do whatever action is necessary for that application. So Abacus is gonna go find all the accounts for that user and delete them. And maybe that strikes me as a little bit odd, right? Like, well, we had this one project but now I have to go do four things. And it turns out that for us in our experience this means a few things. It allows our code bases to be a little bit more cohesive. And again, this cohesion happens because since we are modeling our system in terms of resources, events make it very easy for us to keep all the code about a given resource in one place which is that cohesion that we care about. If you're modeling your system in some other way that may not be the case. This may not be totally transferable or you just may have to rethink the way you name your messages or in your events. Another really nice benefit of this is that if you like to do code review as kind of a cross-training methodology, this means that each of these four pull requests are gonna be a lot smaller than if they were all one pull request. And smaller pull requests are very important if you actually want people to review the code because if your pull request is like a thousand lines of diff, no one's gonna read it. Everyone's just gonna skim and be like, ship it. So if you don't want people to say, ship it, then you need to keep yours PR small and this really helps to keep the PR is pretty small. But it does now incur an extra overhead. We are gonna have to deploy four things instead of one. And that begs the question, how much does deployment cost us? So the major fact, the major costs of deployment for us are that it's gonna take someone's time, right? Hopefully we've automated this thing quite a bit but usually you end up having at least some manual step to decide like, yes we really want to deploy this code. And so there's gonna be someone's time involved here. Secondly, that someone's time is interrupting them from doing whatever other things they were doing. In our team, we have a pretty homogenous team. So the people who deploy code are also the people writing code and we like it that way. We like to have that gap between production and development be pretty minimal. But that means that every developer is gonna have to stop and get interrupted in part of their day in order to do this deploy. And finally, you have to consider system downtime. And for us, it turned out the system downtime mattered a lot. So if you look on this little timeline, there's no brownie points awarded for guessing when did we implement zero downtime? So if you just think about this for a second, during these three years, our team size stayed exactly the same plus or minus one. The number of Rails applications we had in production stayed the same plus or minus one. So that means that we somehow increased the number of our deploys by more than six times without hardly changing anything in the internal system. And we certainly weren't working more days per week. So what impact does that have? Well, the team overall probably hasn't started just writing more lines of code. So that generally is gonna mean that each of these deploys is smaller. And one of the things that does is mitigate the risk and mitigate the fear that goes along with doing that deploy. I find myself all the time in the situation where I'm deploying someone else's code. And that's always a little bit unnerving. And even more often, I find myself in the situation where I'm deploying my own code. And that's really unnerving because I know who wrote it. And so the smaller that change is and the fact that it was a small pull request and probably got reviewed more thoroughly means a lot to me on my team. And it's something that I really value. So the way we do this is with another internal gem, we call it Trebuchet. It's not very interesting. It's just a bunch of extensors to Capistrano that make it so that when we say go deploy this thing, if there's 21 copies of a service, it'll pick like the first five and it'll make sure that they're not receiving requests anymore, it'll stick up a firewall rule, it'll turn them all off, upgrade the software, bring them all back up and then go do the next chunk. And so it does these kind of rolling deploys and chunks to give us zero downtime. But overall, really not that interesting. But if we look back on this, what are the benefits that our team gain from having this convention of zero downtime deploys? We have these code bases that are hopefully a little bit more cohesive. At least they could be, right, if we're good at writing code. We can get easier access to automated deploys because each application is doing a little bit less and so you have less moving things to worry about in order to accomplish a zero downtime deploy. We minimize the cost of doing deploys and the risk of doing deploys by automating them as much as possible and making sure that they're small. And finally, we get this code review, which really for me, from an organizational standpoint, this has actually been the biggest benefit that I've seen personally out of this. One of the big reasons that I wanted to go to work at MX was because there's really smart people there that I can learn from. And the fact that I hear feedback from those people that have actually done real reading of my code on a regular basis is hugely valuable to me. So, wrapping up, the three things that I've learned about distributed systems are, make yourself a directory, publish events, that would be really bad if I forgot that while it's on the slide. And finally, deploy small changes. And a final note here, we saw a lot of good talks earlier today about learning and sharing. All of these things that I talked about are things that you can do in your code or in your tech. But the truth is that your team matters a lot more than your tech. And so, the bigger changes that you're probably gonna have to consider if you're moving from a monolith to some sort of a distributed application is you're going to have to change the way your team works together. And that's a different talk and we're not gonna go over it. But the main things are actually the learning things that we heard about earlier with mentorship and having some sort of ongoing education. Because a lot of us don't yet have a lot of experience working on distributed systems. And so, we're gonna wanna be teaching these things on an ongoing basis on our teams. Also, hockey is apparently really funny. So, that's the end. Thank you very much.