 David Copeland. David is returning. This is his second year here. Last year he came and gave us a really great talk on doing test-driven development of command line applications in Ruby. He wrote a book on that published with PregProg, right? Yep. And now he's at Living Social and he's gonna talk to us about service oriented architecture wackiness. Thank you, David. That's right. Thanks. Thanks. Right. So here's the title of my talk. Services scale, backgrounding and what is going on. So this is really gonna be a story about reasonable developers making reasonable decisions, doing reasonable things as they move away from a monolithic Rails app and allow their architecture to scale and be more managed and handle more things and handle a larger team and all that involves. And despite everyone in this story making reasonable decisions, things will go wrong and strange things will happen and we'll talk about how to deal with that. This is based on my personal experience at Living Social. I work on a application that we call payments and every single time someone buys something on Living Social it goes through this application. So you can imagine it needs to work, be fault tolerant, be running, be monitorable, all that. And it was all extracted from a monolithic Rails app. So there's all this code that was expecting to run in process synchronously and now all of a sudden it's been yanked away from that and has to run asynchronously across several different processes and several different services and so a lot of weird things happen. So let's begin our story. Our hypothetical business, you know, we're trying to get users to sign up and buy stuff from us and all that. So the first thing we've got to have is a controller that lets people sign up. So in our hypothetical business, we don't want people to just sign up and use the site. We want to send them an email and they need to click on a link in that email to validate their email address and then their account is activated and they can use the site. So you can see we have a pretty vanilla Rails controller with just a little extra user-mailer thing sitting in there but that's, you know, pretty straightforward. And this is awesome. It was easy to write, easy to test. We got it deployed. We got people signing up. This is great. This is why we're using Rails. It's great. But we notice after a while there's a little bit of a problem. So let's walk through how this controller works and maybe some of you have seen this example before. I stole it from someone else but you submit the user info, get saved to the database and then really that should be it. You should go on your way to the next screen or whatever it is you need to do but this mailer is happening. So we've got to send this email and only when the email has been sent is control return to you. So this isn't a great user experience, especially if we were under high load. But it also makes it hard for us to manage our resources, right? So suppose TechCrunch posts about our awesome new service and a bazillion people go to sign up. Well, the reality is we don't really need those emails to go out synchronously with the people signing up. The emails can be sent like pretty soon but later. So we want to be able to allocate our resources to maximize the user experience. The way we've implemented this controller, we can't do that. So let's fix that. So we use background processing. We'll take the stuff that doesn't need to happen in sync with the user, namely sending the email and we will put that, we want to run that in some offline task. So we've decided we're going to use rescue and so we just change our mailer to use this line of code. If you don't know about rescue, briefly what it does is this is creating a job that will be serviced by the new person event class and it will be given the person's ID when it's run. This will jam that into JSON and shove it into Redis which is a super fast key value store. So this call is going to be super fast, way faster than that mailer call. In the other end, we've got another process that will be running to process all of these things that we've put into Redis and so rescue, the polar will find this class and it will call perform and you can see what it does. It finds the person in the database and sends an email. This is good, right? So now we can allocate things for the users to get the user experience as good as possible and we can manage all of our background processes separately even on a different machine if we want. This is good. We've got configurability, flexibility in our architecture, ability to scale under certain conditions. So this is a good thing. So a few months later, we've configured our application so if anything goes wrong, we get an email about what went wrong. So a few months later, we get such an email that this line of code here inside our controller generated a timeout, meaning when it tried to talk to Redis to put this job in there, it timed out and blew up. So this doesn't happen often. I mean, I see this happen more than I thought it would happen and it does happen. You can never prevent it from happening. So let's think, given that this happened, what is the state of our system right now? So a user account has been created in the database but that email, right, they need to click on that email so that they can validate and use their account. Well, they're never going to get that email and so we've got this account sitting around that just can't be used and what's worse, they didn't tell you but we have a unique constraint on the email field in the database. So if the user just like refreshed or filled it out again or came back another day later and filled it out again, it's not going to work because their email address has been taken because of our unique constraint. So these were all good things that we put in place and they've all kind of come together to create a weird situation. So how do we fix this? That is how we have to fix this. Production console is great in many ways, right, because we can deploy things to production without having to think of every crazy edge case that could possibly happen because we know if they do, we can go in and we've got the full power of our application to fix it manually in those weird cases. But we're putting the business at serious risk inside production console because you can do all kinds of damage. You can really screw things up. So if we're going in here for the same thing too many times, it might be worth trying to find a code solution to the problem. So what's the simplest thing we can do to prevent this weird state that requires manual intervention from happening? So what we want, ideally we want the person to be created and the email to get sent. That's the perfect solution. But if that can't happen, we would like neither of those things to happen because at the very least, if neither of these things happened, then the system is in the state it was before. It's not broken. We don't have to go in a console to fix it. This would be better than nothing. So we will use database transactions to do this, right, because when we make a person that's done in the database and the only other thing we're doing is sending this email. So we go into our controller. So person's an active record class and as you know, they all have a method called transaction. It takes a block. And if anything goes wrong in that block of code, all database operations are undone, which is exactly what we want, right? If the person create fails, okay, that failed, we're fine. Person create succeeds, but then we have any failure in queuing our job to rescue, then the entire person will not be created. Their email address will not be saved. Everything is, it's okay. I mean, the user still has to deal with a 500 error. They still have to resubmit, but at least they can fix it. We don't have to go in a console and take manual steps to fix this problem. If we continue getting these timeouts, obviously, we have a new problem and this solution might not fix that, but we're at least making things better. So now that we've done this, we don't see these timeouts anymore. Or rather, we don't see this bad data situation cropping up. And we get an exception in our job processing class. So in rescue, when something goes wrong, the job itself goes into a special queue called the failed queue where you can inspect it and you can see the exceptions and the stack traces that happened that put that job there. And you can replay them. So the idea is a transient error, like a timeout or a network failure, you can just replay that and everything will be fine. This is not a transient error, so it seems. So we go into the database and we see the person with this ID is in the database. So whenever this actually executed, the person was not in the database. But we got that ID from the database. So how could that possibly have happened? No one's inserting jobs on our behalf, no one's deleting users, we replay this and it works fine. So what went on? Let's play out the code and how it might have occurred. What could have happened to cause this? So we got our ID right from the database and then we created our event. So the second we created that event, we have this other process running, this polar, and it sees the job and it's like, right on, this is what I'm here to do, I'm going to process that job. So it does that, but the transaction from over here is still has not committed yet. So while we're in that transaction, the ability to roll it back and to undo all those changes is made possible because no one can see what's going on in the database outside of us. But we've sent the ID outside of our transaction to someone else who then can't see what we've been doing to the database. It's a little race condition. I see this all the time. We could solve this maybe by not using rescue, by picking something else and who knows what problems that would come with. That sounds like a lot of work. Is there a way we can fix this so we see this stop happening? So we could assume that if we get this sort of error, the person is not in the database when the event fires, we could assume that we're currently experiencing this race condition. And if that were the case, we could probably just like wait a little bit and try again and probably everything will be consistent. So we're going to assume we've got a library called Rescue Retriotron7000 that handles this for us. And I had another slide that was a big bunch of while loops and pluses and it's totally not interesting. So the point is we're setting up a condition and a condition that we expect might not be met at the time the job is processed, right, person.nil. And so if that happens then we'll retry and we'll wait a little bit and we'll limit the number of times so that we don't get into some infinite loop of retrying. And, you know, it's hard to argue like this is amazingly clean, beautiful code, but it does work and it does solve the problem and it wasn't that hard to create. Because in reality, like, this race condition is probably going to complete in one try, certainly by five. Anything that gets through either means we need to retry more or we've got a new problem that we need to solve that this isn't solving. But we didn't have to like rip out Rescue. We didn't have to grand re-architect anything. We just had to add a little bit of code to help make everything kind of consistent before we executed it. So again, I would say this is a reasonable way to solve this problem. So continuing the story of our company. We've gotten so successful. We're making all this money. We're going to buy our competitor, right? We're just tired of dealing with them. We're just going to buy them up and get all their users and their users are going to be our users. So we need to get their users into our database. Now, our competitor did not have the requirements for validating email addresses. So what we want is we want to create all these records in the database and have welcome emails sent to all of them so they can all click on them and validate that they are who they say they are and all that stuff. Right? Basically, we want the same business logic to happen as if they signed up on the website. But we're being given the data already, so the users don't actually have to sign up. So if we were to state this another way, we could say that after we create a person, we want to run this business logic, which is currently just sending an email. So I would say reasonable way to do that, but use the after create active record hook. This basically says every time a valid person is created, inside a transaction, we will run this method that we specified here, send new person event. And that method will enqueue our job. So now our little bulk uploading thing, all it has to do is create person records. And this happens automatically. And there's another benefit is that our controller loses all this cruft about transactions and enqueuing, and it goes back to being super clean and simple, like even better than it was before. So this really feels like it was an improvement, right? I mean how often does a new requirement actually improve the code base, this kind of feels pretty good. Now, if you ever use active record callbacks, you might not like them. I will not argue with you, I don't like them either, but it's hard to argue that this like wasn't reasonable, right? There's other ways we could have done this, and if we really sat down and hammered it out, maybe we could find one that was definitively better than this. But I would say that this was not wrong. Like it seems logical to find wondering where the logic is about creating people that's in the after-create of callback. That seems logical. I mean Rails provides it for a reason for us to use. So we will see how this decision affects us in a moment, but the point is this was a reasonable way to solve this problem. So a few months later, right, our team is growing. We've got a team that's going to like log some stats and like set up some sort of monitoring of what's going on, and they want to know every time we create a new person, so it shows up in this nice little graph, there is a juicy method that they can just throw that right in there. See, one method, one line, pretty easy. They got their job done in like two minutes. A few weeks later, we decide we need to have a cache of all the people's names that have signed up, and what better place to put that than in the send new person event which is called from after-create. So now another line of code was written, not too bad. Of course we've got this Skunkworks project. We don't know what they're doing. It's kind of crazy with like a Markov chains and weird Bayesian networks, and they got to hook into this process somehow, and we don't totally know what they're doing, but they have to add their lines of code in here as well. So now, right, we have a method that's got a lot of weird stuff and it doesn't seem to have anything to do with anything. And, but you know, I flashed through these in a few seconds, but you know in reality this was like potentially months, months of time has elapsed, and thousands of lines of code have been written, and only three of them ended up in our poor abused method. And so you might say that the person that put this, I mean come on, they should have, they should have cleaned this up, right? Well I don't, you know, I would say it would be unreasonable to have expected someone to like re-architect this thing just because they put one too many methods into it, right? The business didn't necessarily need them to spend a week fixing this dirty method when they could spend five minutes adding what they needed to add and move on with what they needed to do. I mean that's the definition of technical debt, right? We as a business were willing to accept this crappy looking method because we got business value delivered more quickly. So someday maybe we'll fix it. In the meantime, in other parts of the organization, right, this is what happens when there's lots of people, there's lots of things going on at the same time, you know, parallelism in the company is just as weird as in the computer. So we've decided that we don't like mailers, right? All of our Rails apps are using these mailers and half of them, like the mail is not going through, it's getting a spam filter, the other half, it's working fine. We need to centralize this and have it in a single controllable place. So mailers are gone, new apps will not use them, and we're going to migrate existing legacy apps to use our new awesome mail service. And the team that is designing this has gone to great lengths to make it just a drop-in replacement for the mailers that we're all using. So they've made it so that we don't have to add any extra crap to our app, just a little bit of configuration and replace one simple call with one other simple call. This is going to be an easy migration, so we think. So to get around to us, and so here's our new person event, and gone is the user-mailer call, now we're calling mailers service, right? And it looks pretty much the same. You know, things are in a little bit different order, but it's pretty much the same, it's pretty easy to understand, it was pretty easy to get in there. Now if you've ever done this, you know that it's not as simple as just flashing a couple of slides by, and it actually takes a while, especially if you're the first app that has to use some new service, you might have to go back and forth for finding the API, making sure that it's going to work. So while that was going on, right, there's another thread of things where someone has decided that that method we saw a few slides ago, they don't like that. This is terrible. It's called send new person event, and there's a bunch of crap, has nothing to do with sending a new person event, someone has decided to take time out of their busy schedule and fix this, right? Here is their solution. Remove these lines. So now this method does exactly what it says it does. You know, there's so many things wrong with this, like all this is running in process, it makes our tests hard, we have to mock all this crap out, like they're really doing us a favor removing it, and putting it into the event itself. So this seems a lot more logical. All of this stuff is now running in an offline process where we don't care if it takes a little bit too long, and we don't have to mock this crap out, we're using people, objects in our tests, we, and it seems more central, right? Before we had the business logic of a new person kind of split up. So now it's more in one place, and it does seem logical that stuff about making a new person would be in the new person event. So this feels like a really good change, and so we're excited that this person took time to do it for us and fix and clean up our application. So we've integrated the mail or service, we've taken care of this technical debt, we go up into production, and we see an event has failed. Events fail not all the time but on occasion, right? Because there's certain transient problems that happen. So what happened in this one? You know, normally we replay these because they just get fixed, but this had a little bit of a weirder failure mode. So we sent mail, the mail or service call worked like a champ. Paying the stat server, so our dot on our graph is going to show up no problem. But then when we went to warm up this cache, some weird thing happened in the network and it just blew up. So the whole thing just died. So now we have an event that's half played, like part of it was played and part of it hasn't been. So we can certainly fix this in production console, but we don't want to be in this situation at all. I mean this is bad. So we could take every line of code and put that in an event that calls the event for the next line of code and so on, like we could do that, but that's why we're not using node. That's terrible. We want our code in the order that we wanted to run. So if we were to replay it, right, obviously we know what's going to happen. The mail is going to get sent twice, and then the user gets two emails and they have two different unique values for some token and one works and one doesn't. And who knows which one they're going to click on and they're going to call customer support and then we're wasting time on that because we couldn't solve this problem. So how could we? We could certainly code around it. It's going to be a little more complex to code around it and aside our app. And I would say that the real problem is that the way we did the mail services design, it's too dumb, right? We spent all this time making it a drop-in replacement that is really easy to integrate with, but yet it lacks some useful features that would allow us to not have this problem of having a half-played sort of thing. So what things could the mail service have included that would make this more helpful? Well, the calls could be idempotent and that means that you can call the call a million times, but it's only going to have an effect once. That would be darn helpful if we called, every time we called mail, it just only sent one mail to the user. Like, if that was what the mail service did, that would be helpful. Now, the mail service doesn't know, I mean, how's the mail service going to know which emails to send and which ones not to. So, talk about that in a second. Another thing that would be helpful would be if the mail service exposed some of its internal state to us. So I could ask it, did you send this email? If you have, cool, I will not ask you to do it again. But if you haven't, then I'm going to do it, right? These two things, either or, would have been pretty helpful. So for making it idempotent, like the way that we would implement that without the mail service having to know the business rule of only one welcome email being sent per person, is we could have the mail service require a request ID. And it would promise that it would never send mail more than once per request ID. So as long as we, the client, generated a valid request ID that was unique for the thing we want to be idempotent, everything would work. And that would be pretty easy, right? The name of the email and the person ID, that would have probably worked for us. And like I said before, right, you can call it a zillion times and we would get the same result back every time. So we wouldn't even know that we've called it a zillion times. That would have been really helpful. That would have solved this problem and allowed this event to get replayed. If the mail service instead couldn't do that but wanted to expose its internal state to us, right, what emails have you sent, and then the client can check before, you know, that would work too. Not ideal, but it would definitely work. So this change kind of sucks though, right? The other changes that we saw were pretty small and they weren't too dirty and we could do them all within our app. But here we have to redesign the mail services API and so there's like, that's kind of a problem and then, you know, you've got a three-month-old service with V2 already. I mean, it's not, you know, would have been nice if we thought about this maybe a little bit more before embarking on this part of our journey. But we didn't. So now we're starting to get paranoid, like what if the mail fails at the gateway? We said it got sent but then it didn't actually ever get sent. What if I forget to check the internal state beforehand or some client like doesn't get that right? What if two requests come in at the same time? There's a race condition there, then two mails can be sent because that might not work out. What if two apps generate the same request ID and so one of them sends the mail and one of them doesn't but he doesn't know that because, you know, like this is just crazy. You can get out of your mind trying to think of everything that could happen. So and the bad news is that you're never going to get it fixed, right? You can't make it perfect. You can't prevent every weird thing from happening. What you can do is try to have a better understanding of what's going on so that you can kind of evaluate how to solve these things and when when they're happening. Okay, so that's the end of our story and now we'll talk about some like real things that you could do given that you might now be paranoid about using services and background processing. So the first thing super helpful is to have an historical record of what happened, right? So Rails has a logger that doesn't have time stamps. It uses a multi-line format and has ANSI escape codes. So you gotta you gotta fix that and you gotta start using your actually useful logger to log things about your app. Rails is great at logging like what it's doing and that's good but you need to know like what your app is doing too. Business logic that's happening, paths that were taken that might be interesting. Just think if something important fails, what information would you love to have had to figure out what went wrong and then log that? If it's feeling expensive, do it a debug and turn debug off by default but really put logs in and then only take them out if you really need to for performance reasons. With that there's certain tables in your database that are important like an order or a person or something like that where mutation of those rows you might want to know about it. So you can audit that. You can figure out who changed that. What did they change? When did they change it? Maybe even why they changed it. And these two things together can allow you to piece together exactly what happened at some point in your application's history. You don't have to sit there and try to reason about what order things happened in. That's what we had to sort out some of these problems. If we had a log we can just see exactly what happened and in what order. So that is super helpful and it's often not the order you think it's going to be. Secondly bad data. You do not want bad data in your system. Active record is not, active record validations are not a way to prevent bad data from getting into your database. Active record validations are great at getting a user to give you valid data in a web form. But they can be easily circumvented. You can just go straight to the database and not even use them. You can use things like uniqueness that don't even work. You can't rely on this only. You have to have this because this is good for usability but the database you know put constraints on the database. If a field can't be know make it not know. If one record should always have a related child record make a foreign key constraint. Something should be a number make it a number. This way no matter what your code is doing you know the database is going to keep things in order and it's going to keep the data good. Now database constraints are very simplistic especially if using MySQL you don't have very sophisticated checks that you can run. So in conjunction with database constraints. Sanity checks write some scripts that go and examine the state of your system and see if things that should never happen have happened. And then email someone that that is the case. Run this in cron all the time. And then when it comes back with something don't code around it like fix it. It might take longer to fix it than to code around it but it will pay off easily down the road. And then you don't have that you know then your code has as few like stupid unnecessary complexities in it as possible. Now along the lines of fixing bad data are the second thing that can go wrong which is errors in your app. If your app raises an exception that is unhandled that is a bug in your app and you should email the team the whole department the universe I don't know but the point is an email from your app about this should be we got to fix this we got to stop what we're doing and fix this problem right now because the app has a bug and it's preventing the user from doing something that probably is important to your business. Now some of these right we are realistic not every bug is worth fixing not every problem is significant enough to stop development of new features and fix this bug. So you want to codify that decision by downgrading those errors to warnings so that they are not filling up your inbox with unactionable bits of information. You want information you want emails from your app to be actionable and you know like I said read alert. Finally when you're getting around to extracting the services right don't just be dumb I think it's worth spending a little extra time thinking of some failure modes that might occur with the service and for each of those ask yourself like how serious is this if this happens what would be the state of the system with this happens and how do we fix it. If the answer is horrible and a big pain in the ass process then you might want to design for that in your API. You know item potents is a great way to do it the system I work on certain aspects are not item potent and I didn't even think about what that even was initially and now I wish it was because there's a lot of complexity around preventing these bad things from happening. What I do and why I concluded this is my service does provide some introspection API so that the client can figure out what's going on and possibly be a little more self healing. So this is an easy way to do it if you've already got something in place and item potents is going to be too complex. Okay so this is about it record your history prevent bad data fix errors that happen or downgrade them to warnings and make your services a little bit smarter. So if this sounds like fun of these problems sound like they're really fun to deal with or to experience I don't know how would I answer this question but I have really enjoyed solving these wacky situations living socials hiring as I'm sure you're aware come talk to me. I wrote this book it has nothing to do with anything in this talk. But I'd love if you bought a copy of it. And that's it the slides are online and that is all I have so thank you. Great we have time for a question and maybe two really quick ones. I really thought I'd use up all the time so I wouldn't have to answer any questions. For the problem of the rescue job where you have to retry it. Did you guys consider using after commit which is new in Rails. Yeah so in a lot of places after commit does solve that problem pretty well. Unfortunately some of our apps in Rails too so they don't have that. So that's kind of where this came from. And then all Rails 3 apps were like written as if there was no such thing as after commit so we've been learning about after commit. One more. Okay that's it. Cool thank you. Thank you very much.