 I'm Chris Hoffman and we're gonna talk about services, specifically how to move from a seemingly intractable monolith to an ecosystem of services that you can operate well and sustainably. I work at UpToro. We're an e-commerce firm, headquartered in DC. The only thing that's interesting about us is how we source our inventory. We aggregate retail returns. So if you're a best buy or you're a target or you're some other major retailer, returns are annoying. What we do is we take that return merchandise off your hands, we test and grade everything that comes through our warehouse and we figure out whether it's based on the client's recovery goals, whether we should resell it individually or resell it in bulk or donate it or recycle it. Then we sell stuff on a variety of online marketplaces, Amazon, Best Buy, eBay, our own marketplace, Blink.com, things ending in Q will happen again. Even after five and a half years, I don't know why we end things in Q, we just do. That's about all you need to know about the business model. So I've been at UpToro for five and a half years. We still have a monolith, but we're moving towards, we're more and more moving towards a ecosystem of services around it. And over that period of time, we've done several service projects. I've been involved in a bunch of them. They weren't all successes. So what I'm gonna do today is talk about three of them. I'm gonna tell you what we were trying to do, so you have enough context, what we did wrong, what we did right, and then we're gonna finish the talk, or I'm gonna finish the talk, by giving you some advice on what to do before you start doing any service projects, how to approach your first service projects, that's different than the ones that'll come after, and then how to operate an ecosystem of them without having a bad time. So I started in 2012, but our first, and I walked into a company that was already trying to figure out how do we break apart our model? These were conversations that were already happening when I got there. We didn't actually give it a go until 2013. We started with Auth, because Auth's easy, right? So yeah, no, not so much. With any multi-tenant software, Auth is never just permissions. It's always more than permissions. In our case, specifically authorization, to authorize a request, we need to know not just what permissions a user has by way of what user group they're in, but also who they're employed by and what warehouse they're assigned to, because our software runs on a hardware that we provide in our clients' warehouses, which presents an immediate problem for extracting things to an Auth service. It's really easy to see that the user-user group permission model cluster is going to be owned by Auth and it isn't gonna be owned by inventory anymore, but then you have this hanging relationship across a network boundary, because Auth still needs to know the client and warehouse to authorize a request. But those are core models to inventory. Like it's an inventory management system. Warehouse is kind of a central concept. So we were a bit at a loss. And we didn't really know as much about distributed systems in general as we do now. So we didn't really solve this data sharing problem. We just kind of punted on it. So whenever inventory wanted to authorize a request, it would make a request to Auth. And then Auth, so we'd go, hey, Auth, is this person allowed to do this? And then Auth would ask inventory, inventory, where do they work again? And except it had to be a different inventory because in stage, we actually completely locked up our application. See, if you block a web request to do a thing that requires making a web request to the first thing, you do enough web requests at the same time and suddenly your service doesn't work anymore. So yeah, we learned that for production, kind of. That inventory then returns the information to Auth and then Auth gives inventory a yay or a nay. Big problem with this, and the reason we eventually stopped doing this project was that this introduced unacceptable latency into our system. This is just the start of a request and we've turned something that went from zero to network transactions into two, or network requests into two. And like I said, this is just the start of the request. The rest of it, you still have to ship the unit or test and grade an item. And we were just not okay, even with the dedicated inventory instances of the performance penalty here. So we can the project. So it's all well and good for me here to sit up here and tell ghost stories and have animal pictures and diagrams. But if you all aren't actually learning something from it, and it's kind of a waste of time. Fortunately, as I mentioned, we messed up lots of stuff and there's lots of things you can learn so that you'll make different mistakes because that's the goal. The first takeaway and the thing we noticed immediately when trying to pull data out of a monolith is that callbacks are not your friend. They're kind of raised on debtors to hide code from you and if they're hiding code that only changes that model, I don't like it but maybe it's okay in your applications. ActiveRecord makes it really easy to touch related models, like that's its job. And if your callbacks, which are hidden from you, are touching related models and you have to try to extract those related models from your database, that gets really frustrating, like hair pulling frustrating very quickly. Basically, after this project at Optora, we just basically don't write Rails callbacks anymore. We have other patterns for linearizing persistence actions and we try to get rid of the callbacks we still got whenever we can. So I highly recommend that you do this. The one thing I will say in the defensive callbacks is that they do transaction management for you, which is a good thing. All the, with the exception of the before transaction, the after transaction callback, all the callbacks related to a particular save will happen in a transaction so that they'll happen. They will either happen atomically or not happen atomically. You won't have some of them happening, some of them not happening. This is actually really good behavior and you should one, get rid of callbacks and two, recreate that behavior yourselves. I'm not going to say that you should do this or you're a bad developer if you're not because I really don't like people getting up on stage and saying that, especially white men because it generally comes from a place of toxic gatekeeping so I won't, but I will say that the understanding transactions and transaction isolation isn't going to get less important the more services you have and the better you understand these things the nicer, you will have a better time when you get more services. Even if you have callbacks though or even if you have gotten rid of all your callbacks, data sharing is still hard. How do you figure out where shared models go? Which of your services that needs a model owns that model? And like I said, we kind of punted on that and had to spin up a second monolith just for fun. And there are some strategies you can use you can think about sharing a database but okay so you get two apps writing to one database. The problem with this is that no matter which app you are the other app is evil and doesn't share your values i.e. your validations and will corrupt your data out from under you so that's a bad idea. There's a lot of other strategies I'm not going to go into them but the one thing that we should have done and we didn't realize this until like years later like I said after we started learning more about distributed systems in general and you are building a distributed system when you are embarking on a service project. We don't generally think of like Ruby having a big presence in distributed systems but like that's what you're building. So this is what we should have done with the benefit of hindsight. So I said warehouse and client didn't belong in inventory. Yes so that's a lie they pretty much did or they didn't belong in auth. They pretty much did. The right move here is to have auth be own warehouse and client and the way that would work would be auth and inventory, their own apps they've got their own databases and auth now services requests for client creation for warehouse creation and user creation that actually creates this nice US and request flow because you're creating the client and you're creating their warehouses and you're creating their users all in one place. So that's a nice hierarchical setup but inventory still needs that information because it has to service requests about I'm gonna scan this unit in and if inventory goes that warehouse is not a place I don't know what's going on 404 then that's unacceptable for our users. So what we need auth to do is auth to tell inventory every time it gets a new warehouse or a new client or any time warehouse or a client are changed in a certain way or any time warehouse as their clients are removed that one's especially important. We want inventory to freak out if people who are no longer on contract are still using our software. So auth says, hey inventory we signed these new clients they've got these warehouses don't freak out when they try to scan warehouses or try to scan units in at those locations inventory says thanks 200 okay I really appreciate you looking out for me like that. You might think of this as data replication or similar to caching I tend not to not like those words for this application because they imply a degree of sameness and the data doesn't need to be the same in all these places. Basically this is if you were here for the previous talk this is pretty much the most primitive example of event sourcing you can get but the data doesn't need to be the same if we take another example shipping and accounting both shipping and accounting need to understand the concept of unit but they have very different requirements for what that means accounting doesn't need to know how much a unit weighs because it doesn't care and shipping doesn't need to know how much we charge the client for the unit because it doesn't care but they both need a concept of unit. So the general pattern we do now is for every model that we wanna share between multiple services we pick a service usually at the start of the life cycle you'll know where because you understand a life cycle of your models to be the origin service for that model. So in our in our auth case auth is the origin service for users and more importantly for warehouses of clients in our example and what that means is that any other service that's not the origin service we'll call them downstream can have warehouses tables but all of those tables have to have an origin ID column that is essentially a foreign key constraint across the network to the ID column in the auth tables in the auth services warehouses table. Your origin service for a particular model is the only one that doesn't have to have an origin ID put it put a different way it's the only one that gets to make globally system-wide new records. Every other service gets to make locally new records but their reflection or a view of or representation of records you've already got in the entire system and they won't always contain data that was present in the originating system in our shipping example shipping's gonna make a call out to FedEx or UPS or whatever shipping provider we have and it's gonna write down a weight in the database which will not have been present when the unit was received and that's fine. The so the takeaway there is if you need to share data broadcast it don't try to sync it don't try to share a database tell people about what happened tell people about what's happened tell your dependent services about what's happening. So we did fail we learned a lot of things like two or three years down the line but that service did not make an interproduction. One of the other lessons we took from that was it was too big we started we were too ambitious and the whale killed us. So the thing we did the very next year is try to go as small as possible. We said what can we do that's very small and we've looked at product photo processing so uploading resizing and cropping of images and if you're thinking that that sounds like something a background worker can do just hold that thought. So we had this other technical difficulty that was present in the auth service as well but it doesn't make pedagogical sense to talk about it there so talking about here. You're all familiar with HTTP it's a really great protocol. Yeah, you get a client, you get a service you make a request the client or you make a request the service Jesus and the service gives you a response and everything's great. We didn't wanna do this this is old and busted we're a start up disrupting things that have no business being disrupted. So we had this vision like there are reasons for it but they're not amazing we had this vision of the way services were gonna work at a Toro five year plan type stuff. And instead of using a decentralized synchronous protocol which is what HTTP is we were gonna use a centralized asynchronous protocol. It's a couple reasons we thought this I won't go into them in the talk if you wanna ask me about them later I'll be happy to tell you about them it will cost you a drink. But we wrote a rack server that spoke AMQP so that we could use our current Rails apps and just stick them onto this protocol. We also wrote a gem called AMQ party which as you can guess mimics the semantics and interface of HTTP party and here's what it did. And yeah, so we got a client a service and because the client service can't talk to each other directly the service has a request channel that it's gonna read inbound messages from and because the service can't respond to the client directly the client has to have a paired response channel that the client is gonna read inbound messages from it's great. So the client so AMQ party publishes a request message notice not sends a request publishes a request message there's a dangerous territory right here to the service request channel which the service which our rack AMQP server is pulling for and then it invokes your Rails application gets the response publishes a response message to the paired client service response channel and the AMQ party as soon as it published the request message spins up a polling loop on that paired channel and publish and gets your message. So this is a really complicated technical diagram I'm gonna show you one that's much simpler and conveys the same information. So even with all this nonsense we did it we like we made a service except not really because it's still used inventory's database so you can't really call a service and as previously mentioned those of you who were like that sounds like a background worker we're correct. We put a lot of dumb infrastructure for a background worker. So the thing we learned is that adding is easier than extracting and it would have been even easier if we didn't do all that nonsense. If you're going to do like I already said that data sharing is really hard in fact it's one of the two hardest problems you're going to encounter when doing service ecosystem work we'll come to the second hardest problem later in the talk. So for your first service you should probably do something that involves as little data extraction as possible you should add functionality to your model with in the form of a service instead of extracting data or instead of extracting functionality. Our problem was that it was so small that no one cared very much or even knew about it. I had to be reminded like three weeks ago that this was the thing we ever did actually. And the problem with this is that if you are doing if you are thinking about having a service ecosystem or microservices as they're sometimes called not sometimes they're mostly called that they're only sometimes called services that is an inherently political project you are reshaping the way engineering will be done at your organization and therefore you have to understand and deal with political realities and one of them is this concept of a win and you can have an engineering success but if it's not a win because it doesn't make someone's life better and it's not a visible win you're probably not going to have enough political capital to when you say I want to do the next one it wasn't the first one so great what first one we don't remember seeing this you're probably not going to be allowed to continue so for your people's first service I would definitely recommend that you start with something that will have impact that will make people's lives better that like I don't know if the concept of like brand is super applicable but something with impact that people will notice. And just do the simplest thing if possible work if you have services service communication you just use HTTP HTTP is the best HTTP that I know of I would not recommend other non-HTTP HTTP replacements you want to be like Drake in your approach to software architecture you want to shun and be afraid of reproducing synchronous protocols with asynchronous ones and you want to think that just using the synchronous protocol is the best thing you could do man's very, very wise about software architecture I think it's the jacket actually really want that jacket So there were service projects that we did after that but there was one service project that we did after that but it was aborted because a wild deadline appeared and we weren't able to make it a service but the next actual service project that we could do was bulk sales pricing in 2015 so we've always done bulk sales at UpToro but it's always been a two to three person team who works in the warehouse and is aware of fluctuations in our inventory and where is the landscape for buyers and sellers of this stuff and they basically call people up and say hey so I got a thousand laptops of varying qualities what do you give me for them and this is really profitable for us and also it's very much in demand that would call us back like half a year later and be like do you have any more laptops I'm making this sound much seedier than it is like zero percent of our inventory has ever been sourced off the back of a truck I'm serious it hasn't So in late 2015 we launched bulk.com there's the cues again I don't get it this one at least makes sense I don't know what blink is bulk is obviously like yeah bulk sales good maybe next product will have a word that describes the thing and also it's spelled correctly who could tell so before when it was just humans doing the pricing or before it was just humans doing the pricing and that doesn't scale up to the scale of a marketplace we could add more humans but the humans that we had don't scale so it was my job to work with our data science team they had a pricing model that they were using based on a historical bulk pricing data and it was my job to make that accessible to our Ruby code and so my lead and I got to talking and we were like do we need this data in the model if and the answer was no we didn't and so we wanted that was the perfect opportunity for another service and the problem was when we had done off and when we had done photos processing our infrastructure team yes we have an infrastructure team had done the provisioning and configuration and set up and creating backups and logging and metrics and all that stuff for Rabbit for our for our janky ass rack server rack server for the workers for everything and they were doing and they were busy doing bulk.com because that was client that was public facing and that's what they should be doing so I said I can do it and you know I had been talking to them a little bit and they didn't have like a checklist but the standard procedure was we use chef cookbook we use cookbooks for configuration we use terraform for provisioning we use data bags for secrets and that seemed to me like yeah it's Ruby files and JSON how hard can it be so yeah it took me seven we had a deadline of nine weeks we had to be ready for regression testing for bulk.com and we made it I took seven days to write the service a really easy service and there was a redesign in the middle I could have done it in three if I wasn't and it took me seven weeks to learn a thimble size amount of each of those and if you will notice the last thing has an ampersand on it indicating that this goes off the page it does the problem with this is one of the motivating factors for the overall service project at UpToro is that we want the time between or we want to go from a developer going that piece of functionality doesn't go there to having a deployable prototype in as little time as possible and seven weeks is no one's idea of as little time as possible and the reason for that was there were too many things to learn however we also had the solution at the same time the reason it took me seven weeks to learn all these things and not like two years is that we had an infrastructure team and they had already figured out patterns for all these stuff I didn't have to look at the various common logging formats and figure out which one was a fit for us they had already done that, we used LogStash I didn't have to research all the popular health check frameworks and figure out which one was gonna be the best fit for us, they had already done that we used Nagios and we Israel's developers understand that these kinds of conventions are one of the most empowering things that you can have like this is what makes Rails Rails the other thing that we noticed and this is gonna sound a little patronized at first and it's not meant that way is that I was able to operate my application like previously we had relied on our infrastructure team for not not like code level not business logic bugs but for database performance stuff for tuning for that kind of thing and they were busydoingbulk.com and there were a couple of times where I was actually able to respond to incidents in production using their pattern but just running it myself which was very empowering and we came away with the conclusion that if we're gonna have more services developers should operate the applications that deploy they should be responsible for the production behavior of the code they put into production and the reasons for that are twofold one if you have an infrastructure team if you don't have an infrastructure team there's only one reason but if you do have an infrastructure team they're already operating your infrastructure like whether it's databases or caches or Kafka or Rabbit or whatever they're already doing their job it's kind of it's not really okay to ask them to also operate all the services you're about to start making and the second thing and this applies if you don't have an infrastructure team is that our role as engineers is to deliver value to users through products that's it, that's the whole job and if that's the case and I'm arguing that it is then our job can't stop at merging to master and deploying or it has to extend beyond that into an understanding of how our systems operate in production how our users encounter the code we deploy and if you're going to do that you should have an answer to all of these questions at your organization again having these conventions are super powerful whether you're on a 24-7 on-call rotation and you get the page at 4 a.m. or you get the page at 4 p.m. and you're not on a non-call rotation you're just finished for the day and gonna do some PR review and then head home you don't want to get a page at either of those times and when you do get that you don't want to have to figure out where logs are you don't want to have to ask a colleague where are the dashboards for this again especially if it's 4 a.m. you just want to know so the way that our infrastructure team had already figured this out had already provided convention had already provided production conventions for us you'll want to do for your organizations and again as Rails developers we know the best thing you can do for a product for a convention is to wrap a tool around it you want to make it so that people can't forget to set up logging the right way or so they can't mess up the way metrics are collected and whether that's a project skeleton repo that you clone and change all the names on or a Rails application template that adds like 10 more initializers that set up logging and service discovery sorry and secrets the way your organization handles them or you can do what we do and write a custom build and deploy tool that allows you to write a file and press well write and type command and have containers deployed into production and under two hours you can do that I don't recommend doing that we have our own reasons for doing that one of them is that we deploy on a Solaris if you don't deploy on a Solaris don't do this the deploy part is actually easy deploy is commoditized at this point the build aspect what how to get the not the secret sauce but the production conventions that your organization will use how you do logging how you do all these things and how to get them into your application that's not commoditized and it probably never will be because you like the conventions that you have are probably going to be shared by such a small number of other organizations that you can't get an ecosystem or a community around that who knows maybe rail seven will just have all of those things in it that would be wonderful so what can you do now what can you learn from my pain and our pain actually it's not all mine it's not just mine and take back to your organizations whether they don't have any services or long for the days of a monolith when everything was in the same place well if you don't have any services yet number one thing I recommend is get rid of your active record callbacks for all the reasons I mentioned earlier they make it harder to pull data out of your monolith they hide code from you they create semantic distinctions between bits of persistence where there aren't any and they're just kind of bad that said you will have to roll your own transaction management which isn't hard to active record transaction base or active record base dot transaction do and you've got a transaction and it will behoove you to know about transactions and how to and how to wield them and when it's okay to loosen your consistency requirements to get more performance if you're doing if you're just starting your first service don't start by don't start with what we did with auth and have to pull models out of your monolith as I mentioned the two hardest problems you'll come across that shared models are really hard and figuring out conventions for production isn't gonna be hard but it's gonna be complicated and those are both hard problems and I recommend only having to do one of them at a time and to do that you have to start with something that adds functionality as opposed to extracts it that said pick something that people will notice this is a political project again and so if you want to continue doing it you should probably pick something that people notice and that you will be able to get momentum behind a organization use boring technology don't be like Drake don't use don't recreate synchronous protocols with asynchronous ones use things that you can that you can get people to support that you can actually Google for answers on the internet once you have more than one service once you're past your first service and you maybe want to start extracting some data out of your monolith don't try to share that data don't try to synchronize that data broadcast it pick a service that makes sense to be the origin service for that model and everything else is downstream and only gets to create new local records of that model when the origin service says they can basically and you will, you should operate what you deploy developers should be responsible for the production behavior of code that they put into production and to do that, that's going to be a lot easier if you have conventions for production figure out what makes sense for your organization in terms of how to all the tools you need to respond to production incidents and then make, write a build tool that enforces doing so so that you don't have to that you can't forget or can't mess up configuring these things. You've been a great audience I really hope you've gotten something out of me basically going you see the scar you wanna hear how I got it for as long as I have do we, can we do questions or are we just not doing what? Okay, we apparently can do questions but if you have any other questions of me and or comments or observations I will be happy to take questions on the hallway track I'm unfortunately not a giant but I do have this ponytail so if that helps you locate me then I don't know you're welcome maybe. Anyway, thank you very much.