 I'm Rachel Myers. My talk today is called Stop Building Services. I'm going to do a quick intro about myself. I am on the board and a volunteer with an organization called RailsBridge, which holds free workshops for marginalized people around the world. We are trained to teach Ruby in Rails. And a few years ago, we realized that we should split off some of our back office functions so we could do bridge foundry. And that supports Clojure Bridge and Mobile Bridge and all the bridges. So get in touch with me if you want to help out with RailsBridge or if you have a technology that you want to get into the hands of marginalized people. I can help. Another side project is, along with a coworker at GitHub, Jessica Lord, I've gotten really interested in getting great dev environments running on inexpensive hardware like Chromebooks, because it's a thing that can make programming accessible to so many more people. So if you are doing something like this or you want to start doing something like this, get in touch with me. And my day job is as a Ruby engineer at GitHub, I work on the new user team. So we're focused on improving the platform specifically for people who are joining GitHub when they learn to code. So my talk today includes Star Wars jokes. I am either happy to tell you or sad to tell you that the whole talk is full of bad Star Wars jokes. So both because of the Star Wars jokes and because I'm seeing something that might be controversial that we should stop building services, I brought tomatoes that you can throw at me if you just like them. So I'm gonna hand these to our MC. Raise your hand if you need to throw a tomato and he can sort that out for you. Please don't really hand out any tomatoes. So I'm also going to use a lot of Ruby examples, but this isn't just a talk about Ruby. I think this is a talk about how we're making technical decisions. So in general, my thesis is going to be we need to be more thoughtful about how we're building services. And we can't just say build more services, build smaller services forever. At some point we have to start talking about why we have problems. So just to make sure we're all on the same page, service oriented architecture or SOA for short is a style of software architecture that deploys distinct services in place of a single monolithic application. And outside of Ruby and Java for example, this has a long history, but it's a relatively new thing for Ruby and we're still kind of finding our way. So in Ruby, service oriented architecture mostly became popular because Rails apps make it really easy to go fast to build a lot of code at once. And that can mean that you can get a large application and that really can become hard to manage. So we've kind of told ourselves some mantras that led us to build services and those are things like monoliths are big. Yes, fine, monoliths are big. I give you that one, that one's true. Monoliths have a ton of code and because of that they can be hard to navigate. And it's probably true that in a large application there are parts of your app that are badly factored. That can make it confusing. Another thing we tell ourselves is that monoliths are hard to change. That might be because you had a bad abstraction when you started, over time you stick with it, you build things around it. And when you wanna correct that, now it's hard to change. We tell ourselves that monoliths don't work for big teams. This means if you're responsible for one size of the app and someone else keeps making changes to your app there's not a very good way in a monolithic app to enforce that boundary. And just to give you a foreshadowing of what I wanna say, it's kind of rude to like, if you separate your repos and you have a separate service it would be kind of rude to make it impossible for a team member to have commit access on that. That's all I'm gonna say. You probably have an organizational problem if you're doing that. So we also say that monoliths make apps slow. And I don't think this one is defensible in the end but there could be situations where if you had workers that were dedicated to just a subset of requests then you could scale up just that section and that might make it faster. So that's something we've told ourselves. Monoliths aren't web-scale. This is just a troll, this is just me trolling. If I had 14 different micro applications I could scale them up so well and then I would be web-scale. I think that's what this means, people say this. So these are all things that I'm mocking now. I'm mocking them but if you were to have asked me what the ideal application architecture is in 2012 I would have told you all these things. I was a true believer. So I come and I say these things mocking myself. And all of the examples I'm gonna give today are examples of things that I did that I now look back on and think were bad decisions. So this is not me picking on someone else who is a hapless person who just stumbled into my view. This is me, I'm confessing my sins. So before I jump into what I've done badly I wanna talk about how we should make architecture decisions because again I've made architecture decisions and now I regret them. So if you had asked me back in 2012 what should we think about when we're making architecture decisions? I would have said any one of these things. I would have given you mantras. And the point I wanna make here is that mantras are not what we need when we're making architecture decisions. We need reasons and we need evidence. So I wanna introduce the idea of falsifiability. This is an idea that comes from philosophy of science and was specifically put forth by Karl Popper. The idea is that there is some evidence that could cause me to change my mind. So if there is evidence that could cause me to change my mind that idea is falsifiable. And if there's no evidence that could cause me to change my mind that belief is not falsifiable. And I put forth that all of our architecture decisions should be falsifiable beliefs. We need to be convinceable. We can't just rely on mantras here. So I used to have non-falsifiable beliefs and eventually I changed my mind. So this is a form of my talk today. I'm going to, so I propose that we should make decisions based on evidence. Controversial. I'm glad you laughed. Next I'm going to suggest what we should look for when we look at our evidence. What our criteria should be. And then I'm going to walk through case studies because I take that as like a kicking off point for looking at evidence. And then at the end I'm going to find, I'm going to try to pull out some conclusions. Things that I think we could use as principles going forward. For example, we should group together in services things that are going to change together rather than pulling them apart into distinct services. We should think about the differences between libraries and services and be careful when we start mixing those two strategies. But I'll get to that. Right now, what should we look for when we look, when we try to judge our architecture? I think it should meet three criteria. Whatever we choose for our architecture should make our product more resilient and fault tolerant. It should make it less fragile. We should prefer architectures that let us better withstand and recover from failures. And we should avoid anything that introduces new fragility that isn't absolutely necessary. Secondly, we should prefer architecture that makes working on and improving the product easier. If it's easier to understand and debug and improve a feature in one architecture, that is a better architecture. And lastly, our architecture impacts the ability to work together on teams. I'm sure we've all heard of Conway's Law. That's the idea that the structure of an organization can be mirrored in the software design that's produced by that organization. As an example, if you have a lead architect that and all of the other teams take their orders from lead architect, you might get a main app that doesn't do very much, but it talks to a lot of services. And the services themselves don't talk to each other. Or within a single app, you might have three different teams that own different verticals and maybe only two of them need to talk to each other. So you might get something that looks like that. So it's important to keep these things in mind as we go through. So I'm gonna put this nice border around this slide so you can remember it. So those are the criteria that I think we need to look for. And that's what I'm going to pay attention to but when I go through these case studies. And now all the case studies, the explanation of all the ways I have messed things up. And before I get to that, I need to clarify, I'm an engineer at GitHub and these are not stories from GitHub. On the whole, most of the experience that you see when you go to github.com is a single monolithic Rails app. And we do have some services like our metrics collection service. But most of our app, even things that aren't core functionality live in our main app. So that's things like notifications, the framework to do A-B testing and audit logging that all lives in the main app. So the first case study is what I think a lot of people considered the perfect use case for services. A team I was on got a feature request for a neglected part of the website where the code was much more tangled than other parts of the app. And this was a feature that allowed people to vote on things that they thought should be sold in the application. And this, excuse me, this is a really common feature for e-commerce sites because people think if they voted for it, they'll buy it. That's not always true, but it's convincing. So here's what happened when we tried this. The first thing you should know is that our JavaScript was bluntly a hot trash fire. It's really true. It mixed behavior, presentation and obscure browser fixes, sometimes all in one line and there were hundreds of lines like this. And remember, this is a feature that is fairly simple. It's voting yes or no. That's all it does. So there's no reason that it needs to be like this. Sometimes we didn't understand if the browser fixes were even for browsers that we still supported. There was never any indication of the intention behind lines and there was absolutely no structure. So in some cases, so we ended up deciding that this JavaScript was unrefactorable, which is a fascinating concept in retrospect. That's what we decided. And the server side code had a different problem. I've recreated some code to kind of like describe what the model was. I've created a hat domain model. So there are hats that could be voted on, which were not yet products in the main site. And there were the hats that you could actually buy, even though the behavior of the two is entirely different. And there's this interesting thing where you could, where we created an association between the two things that were the same type of objects and had two entirely separate behaviors. So this is a pretty confusing data model. And if we had been following the advice for managing our complexity that Sandy Metz would put forth, for example, we would decide that we're violating the single responsibility principle. And we would have refactored these to be two separate classes, one that can be purchased and one that cannot be purchased, one that can be voted on. But we thought that building services was an alternative approach to managing complexity. And we thought that we'd be able to refactor as we rewrote the services. So we didn't bother to refactor this in place. So here were our project goals. And by the way, this is Icarus, the boy who builds wings out of wax and tries to fly to the sun. Just this is foreshadowing, huh? Okay, I got the name wrong. And so this is meant to show you what is going to happen. Going to fall to the ground. So we needed to make a small improvement to the feature. And we saw this as an opportunity to refactor and to manage our complexity. So here's what we did. We started with our main DB and our main app. And then we created a service and gave it a new database. And then we did one bad thing, just one bad thing. We connected to our old database so that we could get information about the existing hats. And this should have been a warning sign for us, but it wasn't. And then we did another bad thing. We realized that the main app also needed to know some attributes about voteable hats. And we couldn't wait until the API was built. So the main app connected to the voting hats database. And we should pause here to appreciate that this is the diagram of doom. If you end up here, if you ever draw this, stop what you're doing and stop doing that. Don't do this, never do this. This diagram means that we didn't understand the code that we were extracting well enough to actually extract it, well enough to see the boundaries. So we drew the lines around our new service poorly. Instead of creating a new extracted service, we created an ecosystem of services that mirrors our poorly factored code in our original app. Then we did another bad thing. Realizing there was so much complexity in the hat model and its associations and not being willing to spend time understanding that complexity, we packaged the hat model and associated models in a gem and then we included that gem in our main app and our service. I swear, I make better decisions now. I'm just telling you about it anyway, it's fine. So the first failure here is that we drew the wrong lines around our services. And this is a huge danger that I think is not discussed when people are talking about services. So creating classes with well-defined responsibilities is a precondition to ever trying to build a service. And more importantly, services are not an alternative to managing your code complexity. Teams that are driven to write services because they have gnarled code in their current app are not going to succeed. So if you have unmanaged complexity, the thing to do is to refactor that, to work on that until it's something that you understand and have a handle on. And then you should see if you still want to make a service. And just to like drive this point home, we locked in our bad code with architecture. I drew those databases, by the way, with keynote drawing tools. Do you like them? Second, and it took me a while to realize this, but now it seems quite clear to me. Services and libraries are trained to do two different things. A library is a way of caching behavior in your app, and a service is a way of extracting behavior out of your app and making yourself not responsible for it. So if you find yourself extracting services and then sharing libraries between them, I now, when I see that, I immediately know that something has gone wrong. We drew the boundaries incorrectly, or we haven't fully extracted behavior, or there's some core functionality that we think we need everywhere, which sounds to me like not the right boundaries. So coming back to this list of things that I went from architecture. This project is really focused on helping us improve the code. It's a number two. And I think we failed. We didn't make it easier to understand and improve our code in the future, or create a more nimble application. We made it worse. It became harder to understand the code and to change our code in the future. So to talk about how this is handled today, GitHub is a giant monolithic Rails app, and there are corners of it that are neglected and poorly understood and not tested or poorly tested. So that makes it very risky to refactor, but it also makes it absolutely essential. So to help with refactors and rewrites like those cases, we maintain an open source project called Scientist, and Scientist lets you define an old code path and a new code path and run those both in production. So you get to report back when they disagree, and you can investigate those disagreements, resolve them. What maybe the old code path is right, and you found an edge case that you need to account for in your new code path, or maybe this is a bug in your old code, and you'd be surprised how often you find those. So you get to dark ship your code and test it with real users without impacting the real users. And if you want more information about this, Jesse Toth, my co-worker wrote, gave this really great talk called Easy Rewrites with Ruby and Science. In her case, it's an explanation of how she wrote all of the permissions code at GitHub. So she went through every corner case that she was getting for months, and in the end she shipped code that was tested and reliable and well factored, which was the exact opposite of what it was before she started. So it was a very impressive rewrite and almost impossible to do well. So it's a great talk. So second case study. We started building on these services, and we decided to build out a service that would manage all of our authentication and authorization, because as we imagined ourselves building lots of applications and everyone needing to authenticate. So this seems like a natural first step. And it was also enticing because our complexity was concentrated in our God model, the thing that our app is about, and in our user model, and this would pull out all of that user code. So it sounded enticing. I should say that it was about pulling out the complexity of our user object, because not all of the complexity of our user object was in the user model, which is part of the problem. It was actually, the user objects were actually modified in a lot of places. So we also had a second problem. And we were beginning, we were at a size when we were starting to have problems with team ownership. First, teams that consumed the API or had feature requests for the API would sometimes just make their own changes in the API without consulting the teams that were responsible. And secondly, we were seeing more and more of people making proposals for code that they weren't responsible for, and then zooming off. I now call this the swoop and poop. They're like a sequel. It's not cool, don't do it. So this was especially problematic for the team that was maintaining the users, and extracting that code seemed like it would give us complexity management and the organizational separation that we really wanted. If I look back at what I want from architecture, it's focused on number two and three. Encourage the understanding, debugging, and changing of code in the future, and help teams work together better. So here's what we did. We learned from our past mistakes and we didn't create a new database. We pointed our identity service at our existing main database, and when a request would come in, it would go through the main application, which would call the identity API to the database, and then it would return a response. So it's a little circuitous, but it's not the worst thing in the world. And when we first deployed it and it started working, it felt great. It was working. We were running services. We were living the dream. It was like this. I'm way more entertained by my jokes than you are. It's fine. So around that time that we were creating our identity service, we also launched an iOS app. And the iOS app would need to authenticate users just like the main app would. So you might hope that authentication would work like this, a request would come in, the iOS app would call the identity service, which would hit the main database and return a response. But again, the user objects were modified in places across the application, and we were never able to fully extract all of the user behavior into the identity service. So as a result, the identity service needed to call back to the main application and the identity service for every authorization or authentication call, including those from the mobile app. This is a failure. We created the identity service. We ran the identity service, but it couldn't do anything on its own. And we accidentally created something that was much more dangerous. First, we brought on ourselves a lot of non-trivial operational complexity. And now we had one app that we needed to skill up and it was called frequently. You have to know things about the users on almost every call, on almost every request. So we added a lot of database hits too. But more seriously, the identity service wasn't capable of functioning without the main application, and the main application couldn't function without the identity service. So both applications needed to be up all the time for the site to be usable. So instead of one single point of failure, our main app, we now had two. So a problem in either app would bring us down. It would devastate us. If you can indulge me in a little bit more of Star Wars humor. It's like the droid command ship. Do you guys remember how the droid command ship goes down? There are all these fighters trying to hit the droid command ship and it has its shields up. But then Anakin crash lands on it somehow because RTD's used with him and he saves the day every time. And so he crash lands in the droid command ship. He accidentally fires the gun on the ship as he's trying to take off. And he takes out the main reactor of the ship and all the droid stops. So the jet is like, wow, wow, wow. And they just all die and they're like, okay, we won. That is what we created here. We created like this giant single point of failure in our app. Okay, I won't go off script again, I'll stand script. So we're still, the identity service wasn't relieving the load that was on our main app in any meaningful way because it was still hitting the main database. And we had more network calls than we had before. So it's common to hear SOA touted as a way to reduce upload and in so doing to help scale applications. But SOA in scaling is very subtle. This is how I learned about queuing theory. So to just give you an overview, starting out we naively believed that anything that would help us reduce load on the main app would help scalability and our server response times. For example, if we had 1,000 workers in the main app and we added 1,000 more workers to handle just authentication, that seems like it would be an improvement. But the identity service could still create a bottleneck for us because it's like a coffee shop where you have to stand in line to order your coffee and then you have to stand in line to pay for your coffee. You have to stand in line to pick up your coffee. And with the monolith I just have to stand in one line. So it really is possible to scale up that multi-line coffee shop and get your coffee just as fast as it is with like one line. But it often requires hiring a lot of people and really optimizing your flow there. And it's not as simple as just like creating a new service, adding workers, it's done. It usually is not that simple. So in cases where there's a difference in priority or where there's not a difference in priority, it's often faster to have just one large pool of workers that can respond to any requests. And of course that changes if you have different uptime requirements. And additionally, we had to consider the time for network calls. In our case, we had to go back and forth between the identity service and the main app quite a bit. So we had a lot of network calls. And in the case of the identity service, for the same amount of resources, we could get an average response that was faster if we had a single pool of workers. So that calculation definitely changes for services that are used less frequently or that could time out without devastating the user experience. So in terms of our architecture standards, we made the experience more fragile and less resilient. And we have two single points of failure and added operational complexity. And then we had our other goal of defining clear team responsibilities. And I think the result here was an attempt like we got started, but it wasn't a clear win. It didn't solve our core problem of teams not respecting each other's responsibilities. And I don't think that we should totally rely on software to do that. I think a lot of it is an expectation that you have between people. And I can't say that we at GitHub do this perfectly, but I'll show you our solution to this problem. Every controller class sets a team that is responsible for it. And those values are pushed along with the errors into our error tracker. So in my error tracker, when I view it, I can see all the errors that I'm responsible for. And because I'm going to see those errors and I'm going to be responsible for the errors that happen there, that also creates the expectation that if I make changes in a part of the app that I'm not responsible for, if someone makes a change in the area I am responsible for, we should get each other's sign off before we merge those changes. So case study three. This one was not destined to be a disaster. The implementation made it one. So when we had a monolith, we needed one set of assets for the website. Soon we had a mobile web experience and lots of services, we had a few native apps. Maybe now they have a watch app, I don't know. So we need to find a way to share all of our assets across all of these applications rather than repeat them. These are things like the header, which was visible everywhere. We had a login modal. If you do things like e-commerce, it could be size charts, anything where you want to share things that our customer are facing. So we had one main goal. We went to avoid repeating HTML, CSS, JS, and images across all of our apps and services. We at least wanted to make them available for all of those things to use as they saw fit. So we could have done this by copying and pasting things that doesn't avoid repeating, that's repeating, but we thought we could be more clever than that. We thought of our handy, cheap solution that I think every Rubyist jumps to, we thought we could make a gem, and then each of our assets could just fetch that gem and pull it into its own library, or pull it into its own code. So I'm using a gem here to indicate a library, but I think the exact same thing applies to any libraries in your app. If you use Bower packages or MPM packages, the same principle will still apply. So here's how it went down. Each app would include the gem and pull all of those assets, and this was very cheap because it's very easy to do and it's easy to reference all those assets in your code. And the trouble came when the assets needed to change. So this is like just a generic gem prime or whatever. And when the assets change, those are sometimes dramatic customer-facing changes. And in those cases, we want to update all of the assets everywhere. That's not how gems work. So imagine that each of these services has a UI component associated with it, and we increment our gem version. We then needed to coordinate updating all of the applications, and somehow, magically, redeploying all of those apps at the same time. That's not how deploys work. So this was a really interesting failure. We drew slightly better boundaries around things, but we still got it wrong. We didn't draw just the right boundaries. And we still came to really regret this architecture. Sounds good. So what went wrong here? This became a huge pain point because our deployment workflow made it hard to make these changes. And we forgot an important business requirement that the assets need to change uniformly. The solution would have been okay if the boundaries had been different for the services. For example, if assets for a single user experience all came from the same place, I think this would have worked. For example, if all of the assets for the entire web experience had been in one service, that would have worked. And because these apps needed to move together and needed to change together, it made it harder to change our assets, eventually, at all. We had to do a lot more coordination up front. And finally, when people joined the team, they weren't able to just jump in and contribute automatically. In a monolith, you always have the ability to search through a project and find something. And once you start creating services, you have to start doing a lot more documentation. We had a lot of people who would find the header, start changing the header, but it was rendered, right? They were changing this one version of the header. And we weren't going to re-bundle those when we deployed. It was coming from the gem, so that didn't work for them. So it was an important lesson that we needed to rely on documentation instead of just trusting people to search and replace things. Going back to this, to our standards for architecture. This project was really about code quality, about making it easier to update assets rather than copying them, so the second requirement. And I would say that although we did avoid having to repeat ourselves, it came at a bigger price than we expected. We needed to really ramp up our documentation game. And because deploying changes were harder, we ended up finding it hard to change anything at all. Also, no one told me that when I started building services. I would just like to say to all the people who wrote blog posts back in the day, you didn't say that. So, although this wasn't really terrible, and it wasn't a dangerous change, like the identity service, we didn't really get what we want, and I would consider this a failure. So I would like to indulge in going through what some alternatives would have been for this one. In retrospect, I think the point about services and libraries doing different things applies here again. We could have built something that would change uniformly. For example, if we had an asset service instead of a gem, so it's not being cashed in the service, it's calling out to the service for all the assets. That would have helped us update the services, but it wouldn't have helped us deploy those changes. So it would have been an incremental improvement. I think an even better solution would have been to create a front end that contains all of our assets, and then a back end, and that would also have helped us with our native apps. So I just worked my way around to very standard architecture. There you go. Also, monolithic apps don't have this problem, but they can have similar problems. You can architect your monolith in such a way that you run into similar things. At GitHub, we package up our assets, and we use them across several applications, and we do run into pain points around versioning and incrementing those, but this is very crucial. It's not a customer-facing problem. The entire customer-facing experience is served from a single monolithic Rails app. All of the other apps that include these packages are internal apps, and we include the core JavaScript and CSS so that everything that we have feels get hubby. But those apps don't need to change at the same time as the customer-facing apps. So one last use case. This is about failing gracefully. The point, at this point, I've given a version of this talk before, and people came up to me and said, yeah, it's weird. I don't know why anyone would ever build a service, which I felt a little bummed by because I never had that insight. And so I'm going to present a case where it kind of worked out okay. So I was not working on a social website, and I was not working on a social network, for example, and we got a request for social features. So say this is our website. I had a lot of fun making the slide deck just in. This is what we consider essential for our customers to shop, and then we got a feature request for comments on this page. So in this case, they're just commenting with fuzzy JavaScript screenshots. It's fine, it's fine. So the first thing that we wanted to accomplish when we were building this out was that we didn't want to impact the core functionality of our site. This is a bonus feature. This isn't essential to using our website, so it has very different uptime requirements. If the comments go down, or they get slow to the point that we're just going to time out the request, that's fine, our site is still up. And this fits exactly what our first requirement for architecture is. We want to increase our reliability. So we built this out as a separate comment service. We could build it as a comment service, or we could build this as part of our monolithic app. If it's part of the main app, then we need to make sure that we Ajax the request, or that we create the content after the page loads, or paint the content after the page loads. And maybe we should relegate comment creation to a background job, and we should set strict timeouts to make sure that if we're fetching a bunch of comments, we don't take up too much time, all those things. Or we could build an identity service and get a lot of that for free. We're automatically going to get that isolation that we want. And if we're only building an API that returns, for example, comment JSON, then it's very independent. So we ended up building this as a separate service, and it worked okay. And I don't have any failures to point to. All I have is grumpiness. So my grumpiness is this. There were some details that we had to mine, and the client of the service has more to do here than the actual service does. We had to compromise on when we would repaint the page, because we wanted to avoid that. We ended up doing it below the fold, so most people wouldn't know. But I still think there was a pressing need for it to be a separate service. As it is, it's fine, it runs, it functions, but now we have two apps to keep up all the time, and it doesn't have to be up all the time, but you still have to always think about it. So why do that? Why? Grumpy Rachel says why. So, going back to my architecture requirements, we specifically went to focus on fault tolerance, and I think that worked. That being said, why? We could have done that, we could have done it other ways. So we're almost done. I have some conclusions. I started by proposing that we should make architecture decisions based on evidence, wild. Next I suggested what we should look for in our evidence. Then I went through some evidence, and now what do I conclude? First, if you really have to do SOA, you should untangle your code before you try that. If you successfully untangle your code and you still want to do SOA, I feel like you've earned it at that point, but I bet you won't get there. Secondly, don't ignore your points of catastrophic failure and the fewer of those that you can build, the better your life will be, the happier you will be, the happier ops team will be, everyone will be happier. This is simplistic, of course, because everyone has a few points of catastrophic failure and it's hard to ferret them out and find them, but the goal really should be to minimize them and we have to think about this when we're building our services. And lastly, there will be parts of your app that need to change together, and you can save yourself so much pain if you think about when things will need to change together beforehand, and you pull those into the same service. And now, a non-star Wars joke. I think a lot of the impetus for building service-oriented architecture comes from frustration with code quality, but services can make those problems worse. So the first step is to focus on your code quality inside your monolith. And my one message, if you can only have one message, if you don't get to have all my messages, if you get one, take care of your giant robots, take care of it, it's great. And so with that, I got some amazing drawings from ExplodingDog.com, which is drawn by Sam Brown, and I went through and I drew skirts on all of them so they're cooler. And I got all my Star Wars memes from Memesinner.com, the fine website and institution. And if you agree with me or you disagree with me, please, I'm happy to argue with you civilly or just chat about these things. I'm Rachel Myers. Thanks. Thank you, sir.