 I have a talk coming up, we're running a little bit behind, so I'm not going to introduce this man too long, except to say that he is the greatest Canadian who ever lived. He was actually born a Mountie in Canada. That's not a thing that happens very often when you're born with your horse and your sword. His name is Adrian, he's from Seattle. He works for a company called WellTalk. Let's see if he talks well. Round of applause please. Hi, I'm sorry about that introduction. I'm actually from Seattle, not Canada, but real close. That's pretty good. Oh yeah, go Canadian. So my talk is about deployment Nirvana and it's kind of some of the issues, some of the challenges and some of the stuff that we learned at a company called Moz where I built a lot of deployment infrastructure for our SOA app. Conveniently enough, everybody this afternoon has been talking about services and why they are amazing and they're terrible at the same time. So it's been really a great, great lead-in. So first of all, the web definitely as we build it is changing, right? We're moving more and more from these big, giant monolithic apps and we're starting to split them up into services. And so this actually reflects back on some of the stuff that we used to do in like the 90s with a lot of the ways that we would factor out large enterprise apps. But now we're starting to do it and we're breaking it into these little small apps that communicate over HTTP or NSQ or any sort of message broker or a bunch of different ways. And so we're starting to split them out onto this. And so it's actually not for scaling performance wise, right? So a lot of times we'll talk about like, oh shit, we've got to break out this monolith because it doesn't scale or it's too slow or any of these reasons. Well, it turns out that stuff like this is really, really, really cheap compared to stuff like this. People are really expensive and it turns out that there's a bunch of cognitive load and you can't just throw more people at a problem. I mean, you know, we all know that we can get a baby in one month if you just throw nine women at the problem, right? Like it's the nature of the system that we work in that as teams grow, your productivity doesn't necessarily grow alongside. So again, with this, with our movement to SOA, microservices, whatever the current term that we want to use is, what's happening is what used to be a very complicated app, very complicated piece of software that was nice and easy to deploy onto your Commodore 64, it's now starting to become all of these individual little apps and we have to deploy them separately. We have to coordinate them separately. We have to worry about how they intercommunicate. We've got all of this extra stuff, this extra cognitive load that we have to understand. And so the problem that I attacked at Moz and that we're going to be talking a little bit about today and I'm going to be tackling for a while in the future is how do we do all of this? How do we deploy 40 services and how do we do it in a way that makes life easy for developers? As an engineer, I don't want to think about what happens when John bumps a service and blows up the world, because why should I have to think about that? We've got versioning, we've got tasks, we've got ways that we can solve this problem. So hi, I'm Adrian, that's not it, hold on. That's closer, but also there's my space reference, Jonan, you're welcome. Hi, I'm Adrian. So let's see here. So I was at Stride for a while, which is a startup that I had with a few guys. Then I went to Moz, and that's actually where I learned a lot of stuff that I'm going to be talking about now. Now I'm at a company called WellTalk. Basically I am writing code and leading teams and doing all of that, all of that stuff that we do and get paid for and we have a pretty darn good time doing it, actually. So we're going to rewind a little bit to when I was at Moz, and this was about a year and a half, two years ago. So we wanted to build a software. It was this glorious, wonderful, green field app, right? So we all sat down and we actually had some pretty smart folks on the team. We had a couple of ex-Google guys. We had a couple of extremely brilliant, fresh out of college folks. I was the anchor just kind of dragging everybody down, but I could help people laugh, so that was good. So we basically looked at the problem and we built this tool called Moz Local. So Moz Local was primarily and is still primarily, hopefully, I hope it still runs. It's primarily about transforming API data, right? So the core of the problem and the core of the app was that let's say that I'm a small business owner and I want to make sure that my small business listing is up to date and correct across, say, Yelp, Superpages, Foursquare, I guess it's now Swarm, Facebook, all of that stuff. And so we had to work with all of these APIs, and it turns out that a lot of the APIs we were streaming and we had a lot of concurrent connections running. So Node.js actually looked like a pretty convenient, good way for us to build it. Because we were doing a ton of parallel IO because there wasn't really a ton of business logic involved. It was mostly just taking these external APIs, manipulate them, normalizing them. We decided to do it in Node. And because of that, also, we decided to do it as a large pile of independent services. Part of this was just because it made sense from the specific app and part of it was team scaling. So when we thought about why we were going to do a service-oriented app or a microservice app versus not, we thought a lot about team scaling versus performance scaling. By breaking it out into separate services, we were able to minimize the cognitive load for anybody. I didn't have to understand the entire app. I didn't have to understand how our caching layer worked. All I had to know is that as a consumer, as a user of it, I would just interact with it with the published, well defined API contracts that we would give internally. It's very similar to stuff where we would have internal APIs where things would be well defined, what's public, what's private, all of this stuff. We don't really do it very well anymore. A lot of times we can just hack and slash out some code. So it was a lot about minimizing the cognitive load, and it was also about composability. So I touched a little bit. We had a caching layer, and we had a lot of interactions where we would work with an API and think of it as a stream so we could compose together a couple of different services. So all this stuff is looking pretty good for us right now. Well, and then we kind of thought about why not service-oriented architecture. So there were some complex interdependencies in that services relied on each other. We had a service that encapsulated our data persistence layer, which underneath it, there was Cassandra, there was Postgres, there was a little MySQL, there was some Redis. We don't really want to talk about it, which was great as a developer, as an engineer. I didn't have to think about any of that stuff. It was just taken care of for me by this service, but it's very complicated. We had some reliability issues, which I'm going to touch on in a second. Debugging it was nigh impossible. I see an exception, and I have to unwind the entire service stack of how this request bounced through 15 services. That was really painful. The tool chains were non-existent. So this was one of those things, like Forman is a tool now that is very good and is helping us with some of the stuff, but it's still hard. It's still hard to coordinate 15-odd different services or 20 or five or however many we're working with. It's hard to do this in a development environment and think about it well. Integration testing was one that was really painful, and the really subtle foreshadowing deployment was hard. So we made the call, and we decided to go with service-oriented architecture. So we got to work. This was basically our look on all of our faces as we started to get further and further into the madness. And you can get that photo real quick then, because three, two, one. OK. So first of all, we were using a tool called Circus, which I don't know if anybody here is familiar with it. It's basically a process manager that allows us to run end processes or services or what have you underneath it. Don't ask why we're using it instead of in a D or any of the upstart, run it, any of the ways to do it. But this was the tool we decided to use. We also had this thing called Collateral Flapping, which I'll touch on in just a second. Deploy is resulted in downtime and service-level versioning. Again, one of those subtle, subtle foreshadowing things. So Circus was our first problem. And I'm going to call it problem zero, because it wasn't even a problem for us. Before we actually even went live, before we launched, we just killed it and replaced it. So the lesson from that one was always vet your tools before you deploy, before you go live, because you can actually get yourself in a little bit of trouble if a core part of your infrastructure just straight up doesn't work for you. So the next one was we had the collateral flapping issue. So because we sucked at Node, we didn't know what we were doing. Because domains were super confusing, for us, again, point one, we didn't know what the heck we were doing. Basically, what would happen is, say we've got five concurrent requests coming in through this Node service and one of them crashes, well, we weren't doing a really good job at catching those exceptions. So all five of them would just die. So a lot of people would get 500s or broken sockets or any of those really painful things. Well, so what we did is we solved it. We built a service registry and an HTTP proxy, the simplest way, of course, to solve a crashing problem. We built a service registry and an HTTP proxy, and we ran multiple instances of a given backend. We were able to do some kind of cool stuff where if we haven't sent any bytes back down on the wire and we're not that far along on a timeout, if one backend goes down, we just transparently use another one. The end user doesn't even have to know that they killed a backend. They don't know. So the second one was we had deployment downtime. Well, we've got a port registry now. So we fire up a new backend. It'll register itself with our port registry. And then we basically just cleanly repoll the old backends and the HTTP proxy just cuts traffic on over. So now we've got this interesting pile of stuff, these three separate home rolled loosely coupled tools called DMV. That was our proxy because we all know how efficient the DMV is at routing requests. We had Cport, which was our port registry, which we eventually replaced with a tool called Portland. Portland, of course, was some fair trade, free range port registry services. We had a lot of, the read me was literally nothing, just put a bunch of hipster jokes. So that was pretty fun. And then Viaduct, Viaduct was our new service registry and our new process manager, because we all know that the Viaduct's never going to go down. So the laugh there for everybody doesn't know is there's a Viaduct in Seattle that it's been going to go down in the next big earthquake that we have for like 10 years, 20 years, 30 years. So now we're trying to cut a new tunnel underneath it. And that's also gone really bad. So we're like $5 billion now to try and cut a tunnel that's going to save everybody in the next earthquake, but it'll probably just flood. So that was Viaduct. So it was the greatest thing ever, right? We've got these great, awesome tools. They shield us from ourselves when we write bad software and it crashes. We've got zero downtime deploys. We've got all kinds of controls. So we can do some really interesting stuff now that we run the code for our routing layer. We can do some fun stuff around versioning. And we can do some fun stuff around cutting over traffic to different users and all kinds of stuff. Well, it's also the worst thing in the world. So, client tooling, we didn't really have any. Our deploys basically revolved around our intern. He was really good at fast at typing out a pile of commands that sometimes changed a little bit every once in a while, so nobody really understood it. It was bad. We fixed that. So the client tooling was really bad. Development was really, really bad. Again, because we had to run 10 services and you have to run say, oh, I don't know, like Postgres and Redis and another Redis on a different port and you've got to make sure that your Cassandra instance actually has two services, otherwise it's not going to fire consensus and things aren't actually going to get stored. Ton of pain. So that was kind of lame. We made it better. It got better. But the really interesting one though is that inter-service versioning, which is when you've got service A and service B and service C, the versioning between them was kind of unsolved. So service versioning, which was actually touched on the last one with Pliny, it's pretty much a solved problem or so we think. We've got all kinds of ways that we can call for a version on a service. We can do it in the URL, we can do it as a query program, we can do subdomains, we can do what Foursquare does, which is they have a timestamp, which is actually really cool. You include the timestamp of when you want this request, the time of the code that you want this request to run against. And so they like sunset stuff after a few months. It's pretty darn cool. So the problem is is that as an engineer, I have to start thinking about this, right? So I have to think both writing and consuming a service about versioning. We're never gonna be able to get rid of both, but we might be able to get rid of one. So whenever I make a commit, I get back a shot, assuming I'm using Git or any other good version control system, I get back some reference to this code at a point in time, which is really similar to what a version is if not exactly what a version is. So we thought about this because we were starting to have issues with all of our different services and deployments and what versions we're talking to what. So we thought about like, is there anything that we can do here? Like, are we able to actually use Git shots as our versions? And so it turns out we can, partially because of the tooling that we had up ahead of us, and partially just because we built a little bit more stuff. So we've got this HTTP proxy. It's our own custom code. We can do all kinds of stuff with it. So basically what we do is we extend it to route to specific Git shots on a service, and we consider that as a version, right? So we include it as an HTTP header or something else. We're also having service intercommunication via NSQ, which is our message queue. So again, we would include the shot, and we would say, hey look, I wanna talk to the service and let's call this service A at this Git shot. And so we'd actually keep them running long enough and we'd have, we were measuring traffic to individual back-ins. So we would know when traffic got cut away from it and we'd just reap that back-in. And so we also can do things with the domains, right? So we can say that production goes to this Git shot, right, and we can say that staging goes to this Git shot. So we've got all of a sudden getting a lot of power with our versioning. As an engineer, I don't have to think about versioning. I don't have to write versioning into any services. You know what I mean, it's doable. There's lots of great libraries to do it, but we're at this point right now in engineering that we can build amazing tools and amazing abstractions. So any way that we can make our lives and our colleagues' lives easier, we should probably do it. So there are obviously some advantages and some disadvantages here, right? So as an engineer, I don't have to really think about or build versioning. That's good, right? I know that the software that I'm writing every time I make a commit, I have a version given back to me by the tools that I'm already using. Another one that I can do is we can test multiple releases simultaneously, right? So branches have, or, you know, commits on branches have shaws. So we can, basically, if we have control over the routing layer and where our requests are going when, we can split them off. And so I can fire off a spike. I can get back a shaw. I can throw it up into our hosting layer and then I can cut some traffic over to it on a custom domain or say I can give out that shaw to other engineers or marketing or anyone inside or outside of the company and it just rattles through the routing layer and finds its way to my new services. There are some challenges, obviously. Outside of building all of the tooling, there are some restrictions that we had with the Git shaw-based route or versioning. So our deployed services have to be reflective. So what that means is that a service has to know what its shaw is. And a lot of times, when you release code, it's not, right? A lot of times we'll do some sort of tar ball, we'll do some sort of archive, we'll use Capistrano, we'll use any of a number of different deployment methods and a lot of them aren't gonna give us any information on what the shaw is or what the snapshot in time is. It's fine, we can tool up around that as part of the deployment process but, you know, it's still something. Another one is that we have to run multiple instances of a given service, right? So this doesn't work really well for memory-hungry apps, right? So luckily we had, I think, one Rails app in play and so that used a couple hundred megs of memory. But we had a couple hundred little node services rattling around and they shared a lot of memory together and this is on a given server, right? So we were able to run it because of the system and the tooling that we were using to building. And so the last big problem that we kind of fought and that we kind of worked around discovering was inter-service versioning. So that's really where things start to get interesting and exciting and really challenging and nobody's actually really publicly solved this stuff yet. So like, let's say that we've got service A and we've got service B. Let's say that service B is some sort of database-persisted service, wrapping, you know, any number of persistence layers like maybe like a, I mean, I don't know, let's say we've got Cassandra and Redis and Postgres in play and we're using each one for different purposes, right? Cassandra's super good at writes and Postgres gives us relational queries and Redis gives us some flexibility in terms of data stores. So they might need to change in lockstep and so this actually bit us a number of times before we came up with our inter-service versioning. If say we bump B, and we don't necessarily think about it because kind of as an engineering team we have stopped really thinking about versioning, present cons of course, if they don't change in lockstep or if there's no communication across teams, which is another one of those things that hearkens back to team growth, I might break things, you know, I might bump version B and all of a sudden this was like an integer last time and now it's a string that's a number, things that happen especially when you start playing with JavaScript, things that happen can break the world and it's not good. So how do we change these in lockstep? So basically what we did, and again this was all through tooling and all through services and routing layers and I hesitate to call it magic but there was some shady stuff going on under the hood. Basically what we do is we built a release version, so we've got A and we've got B and we've got A and we've got B prime and so even though nothing's changed in A, our application which is this collection of all of our services has changed and it's entirely different application from the perspective, ignoring the hosting layer because we're thinking at a layer that let's just pretend that we've got all of the resources that we could possibly want in terms of compute power assuming we're able to utilize them effectively like say some sort of collection of giant data centers that we can use on an hourly basis and scale up and scale down. I don't know, data centers, Haroku maybe, I don't know. So basically in a perfect world we wouldn't have to have separate A's running, we wouldn't have to have separate B's, et cetera, but from the perspective of a consumer of our API whether it's internal, whether it's external, whatever these are separate applications. So we tooled up around this stuff, right? So at deploy time, we would build a, what we call the release version and so the release version encapsulates every single service that we're about to deploy and what it depends on, all of its versions and then we do an internal mapping between all of them, right? So when I ask for service B at V2 that's gonna give me the Shaw of B prime. It works really well actually. It's a lot of wackiness but yeah, it's pretty cool. We have some downsides though. So we have to figure out some way that our services are able to correctly request versions. They don't necessarily have to think about hosting the right version but they certainly have to think about it from the consumption perspective. So like say, I mean like if we roll back to a slide real quick. When A makes a request to B, right? It's got to be aware of what its version is. So it knows whether it's asking for A at V1 or A at V2. So there were two ways that we can solve this and we did solve this two ways. We did it and solved it two ways. The first one's sucking and the second one being amazing. So the first one we did was we literally would just run separate versions, right? And at deploy time when we encapsulated everything when we bundled it all up we would say, this is your Shaw, this is what you request. The problem is that despite having tons and tons of say AWS capability and tons and tons of different computing resources at our disposal, running five separate services that are all exactly the same, it's kind of wasteful and it kind of sucked. So the second way we did it, and this is the way that it's still running now, is each service actually carries through its tag, right? So let's say that we have a node app and it gets a request coming in that major release version gets carried through every request. And so everything rattles through a routing layer which gives us a couple of really interesting benefits. We have really good metrics on the routing layer so we can watch traffic. Because of watching every single request go around inside the routing layer all of a sudden we can debug now, right? So whereas before like you roll into a black box and a request might bounce around through five, 10 services now we can actually replay all of those traces. So this is all getting pretty good for us, right? So again, I guess there are three ways, right? As an engineer I can think about it which like we don't wanna make ourselves think, we don't wanna make our colleagues think as people building internal tools, internal services, people handling operations, we want to make it so that our users don't have to think about it, right? You know, I mean if we were building an app for small business owners and we said, well, you know, step one, you've got to install Ubuntu and then step two, you can now sell things at your flower shop. That's not really fair to anyone, right? And so like when we've got an engineer who wants to just deploy some software for me to say, well, step one, first of all, you've got to fire up your own AWS instance and get Ubuntu running on it and then you've got to get all these services. Again, not really fair, we can do better. So yeah, let's just throw the whole idea of making engineers think about it out the door because we've got tooling. Another one, which was the first one that we did is you can lock at deploy time, right? So when we roll out end services and we build a revision for them, that's what it gets tied to. The third one, of course, is pass a header around or metadata on a request on whatever your request actually looks like. So we're starting to get good in terms of our deployment environment, our routing environment, all that stuff, but there's still more that we can do, right? So we wanna build the client tooling up so our development environment was bad. Luckily now, there are great tools to do it. I don't know how many people are familiar with Forman, but basically it uses Heroku's proc files to get multiple services up and running or in the case of something that might run on Heroku, you might have timers, you might have a bunch of different processes that have to run. Forman's good, it's not the best, it's not perfect, but it certainly gets us much better there. Another one that's really painful is assets, right? So what you can do is you can 12 factorize things, and I don't know how many people are familiar with 12 factor apps, but if you hit up 12factor.net, it talks a lot about making apps and services standalone work happily in an operations environment. We hearken a little bit to things like Docker, where we're containerizing and we're working on that, but it's really all about working on getting that API contract solidified and concrete in between something like, you know, my application that's running, where's it gonna log? I don't know, I mean, are we gonna log to like var, are we gonna log to opt? Well, what if we just use standard out and like push that out onto the hosting layer? So, you know, as long as your stuff is 12 factorized, life is gonna be not too bad. So, the lessons that we learned from this long and painful adventure was that inner service versioning is kind of unsolved, right? And so, it's one of those things that not everybody's figured out when we've got, say, you know, two services, what are we gonna do? Are we gonna, am I gonna have to make sure and version and then collectively bump my version and talk to all of my colleagues and make sure that everybody's on the same page when I bump my version. And so, you know, I'll have to like, hassle Ben to like stop using my V1 API because we're on a V3 net, like no, no, we gotta do better than that. Rolling deployments are pretty much required in today's world. Like, there's no excuse anymore for us not to be able to do this. SIG, I mean, we can SIG winch things, right? We've got beautiful, beautiful tools to do that. And the last one that we really learned with all the work that I was doing was you need to shield your engineering team from complexity. You need to basically have a Heroku-style interface where I push code and it works. There can't be too much magic. I mean, like remember early, early Heroku where they would drop a bunch of gems in and like it's really unclear exactly how the database works and it's a little shady. You wanna get rid of that magic, right? You need to make sure that when, as an engineer, when things go sideways and it doesn't work correctly, you need to know that you're able to debug it. You need to know that you're able to basically get it to work when it breaks. So basically this is kind of my pseudo-closing side is that we need to deploy better. Five years ago when Heroku landed, all of a sudden our deployment world got really nice. Like, we've used Capistrano, we've probably used VLOD, we've probably used a number of different tools. But finally this thing showed up that abstracted away so much of the complexity and so much of the pain that, I mean I know that I deployed like 15 odd different like toy apps in the first couple of months, right? Just like, oh man, I can put this thing on the internet. I don't have to like get some like VPC somewhere and like all of that pain, all of that nightmare. Like, we're in this kind of like amazing time to be an engineer where our tools are getting so good. The APIs, the frameworks, everything is just amazing. For monoliths, deployment's really good. But for services, deployment still is kind of rough. It's kind of a challenge and there's kind of a lot of work that we have to do with it. The other one that we learned of course was that you need to minimize moving parts. Like if I roll back to that slide talking about DMV and Viaduct and the chicken coop out back and all that stuff, you know we had four daemons and we had two persistence layers and this is just to run our software that actually does the customer's needs, right? Like this is madness. So we need to minimize those moving parts and the last one is you need to make it pluggable, right? So one of the other issues that we ran into was as time went, as you know, this stuff software got old there was newer and shinier stuff out, right? Like we first started with Circus as our process manager and then we used Upstart and then we actually ran God for like two weeks, maybe three weeks and then we wrote our own. Now they're hearing right there, but you know, the ability to have things modular and pluggable is gonna really make all of your deployment stack and all of your deployment infrastructure much, much, much easier. So I do have one more thing. I've built this and open sourced it. So basically I took all of the stuff that we learned the hard way writing it at Moz and with actually some of my colleagues who've done some code review and talking with a whole bunch of good friends and good people, I've rewritten it and open sourced it. So hopefully it'll work. It's right now running a couple of my side projects. It's basically a full deployment and tool chain and hosting layer for microservices. As of right now, it runs on your own hardware. My stretch goal that if any luck I'm gonna get it done this weekend is to push it out on Heroku. Because there's no reason not to, right? Like as an end user, I should just be able to call for deploy of all of my services and they're running somewhere on the internet. That's basically the dream and it shouldn't have to be a dream anymore. You know, it's 2015. We've got amazing, amazing tools. We've got great beer. I've seen a reason that we can't get some great deploys. So that's it.