 My name is Steve, I work at Orbit's, hometown Chicago. Today I'm gonna talk about a microservices experiment that we did recently. For those of you that don't know us, we're a travel website. If you used us to get in here, thanks a lot. So I'm gonna start with a quick, very brief architecture overview in history. We'll kind of talk about how we did a transition from sort of monolithic to services to microservices with Docker. And of course, you can't talk dev apps without talking about automated pipelines, and hopefully we'll have questions at the end. All right, so I'll start with the splash page, go back to year 2000. Orbit started by the major carriers. The goal was to create a single place to travel, and ha ha, yes, we're still hiring. Okay, that conference broke out of the way. All right, so the architecture looks something like this. Standard 2000 architecture, you had your web layer, you talked to your business layer, which talked to the airlines. The website was launched in 2000 with the, I think we still call it the Warbot. We had one application, we did releases as needed. And what we basically did alongside the likes of Expedia and Travelocity was we sort of brought travel to the customer, right? It became self-service. And you didn't have to call this nice lady on the phone anymore when you wanted to go somewhere and have her type these cryptic commands into these systems that were invented before, I think most of us were born. And those boxes of paper tickets on the floor, those are gone. So the thing that she was typing into is called the Global Distribution System, or GDS. And so when we're talking about this picture, we're really talking about, we, to communicate with the airlines, you talk to a GDS. And the first one we hooked up to was a GDS called WorldSpan, they're still around. And by the time 2003 came around, we hooked up to our next GDS, which was Saber. So we did this sort of thing, and this was sort of the financial incentive for creating orbits was, if we booked, let's say, an American airline ticket and we booked it in Saber, it was actually cheaper for the airline. So that's kind of how the economics worked. And around the same time, American decided, hey, we gotta get a website, everybody else has a website. So we were happy to provide a business layer for them. Northwest kind of did the same thing. Pretty soon we saw this kind of architecture wasn't gonna scale. So in the 2004 timeframe, we switched to like a services model. And the idea here was that we'd have these nice tiny components that would have different release cycles and they could evolve independently and be operated on by different teams. And this worked fine for a while. We hooked up to other back ends. American eventually decided to connect to Saber themselves. Northwest got bought by another carrier. We, orbits acquired other brands. American Express needed a booking engine. So we provided that to them. Anyway, there's a lot more history here. It's a 15 year old company. I'm not gonna show you everything, but you kind of get the idea, right? Your site is never really done, right? Everything you think you have today, it's always gonna be evolving. This sort of idea that, oh, one day, everything will be using this one thing and this one technology and it'll just be blissful nirvana. Never gonna happen, right? So fast forward to today. We've got multiple brands, web services. So like, for instance, if you search Kayak, they actually go through our web services, websites like Orbits and Orbits for Business and so on, communication with lots of back ends. The platform is now composed of over 500 of these sort of service services that are running on thousands of instances. And we do deployments more or less daily, right? Now, as you can imagine, applications that evolve over a decade tend to accrue process. This is a picture that Jacob, he's out here somewhere, hey Jacob. He took a picture of this. This is somebody trying to sort of stitch out the path from code to production and all the steps it took. Now, if you need to use a panorama mode of your camera to capture this, you are not DevOps, all right? So things have improved a lot. In 2010, I think we were doing maybe four releases a year. You can imagine how well those went. By 2012, I think, which was around the time of this picture, I think we were about 18 days from code to production. Today, I'm happy to say it's closer to one to four days. But in reality, it's really not gonna get much better than that using sort of what we're doing today, right? Because the bottleneck is no longer the technology. It's the people in the process, right? And you started to see this at Conway's Law Play, if you're not familiar with Conway's, basically means that your software is organized around the way your company is organized, right? And so this is kind of the roots of DevOps, right? Dev has their tool for deploying code and operations has their tool. We switched to Chef maybe three years ago from CF Engine. Thank God. But what kind of happened was the pain and complexity around trying to get these two systems to properly align and everything would be happy made it so people stop making new services. And so what wound up happening was they would just keep adding more and more to the services we had. And so now 15 years later, instead of having one big giant application, we had hundreds of giant applications and they were all complicated, right? And so one day somebody came along and said, hey, I have one of these giant applications. I want to break it apart, right? I want to decompose this down into the 40 plus little sub-services. And I want to do it in Docker. I want to use, I want to do sort of 12 factor principles. I want simple configuration. I want to package it all up in Docker. And really the goal was I want to deploy this in minutes, right, not days, right? So this is, sorry about that. So this is an Orbit's landing page. If you basically go to Google or something, you type Chicago hotels, you're going to see something like this, right? This page is basically composed of what are called modules, right? The modules come from various data sources like databases or solar or other services that we have. And it kind of looks something like this, right? The web layer talks to this content orchestration service. It calls all the various modules, composes a page together, and then returns it. Here we're showing three, but it's really like 40 and it's managed by like 10 teams. And so you can imagine, and this is all a job app, right? You can imagine there's a lot of dependency hell going on here, lack of resiliency, anyone, any bug in any one module basically just nuked the whole platform, the whole service. And you also had a really tight release schedule, like you could only release on a certain day. And so you'd have lots of changes from lots of teams. And it's really hard to know when it broke, who broke it, right? And so the idea was, well, let's take these modules and pull them out. We'll make all this run in Docker and, like I said, use 12 factor principles. And basically what this allowed was more team ownership, right? Every team was responsible for making the changes, doing the configuration, doing the deployments. Really what we wanted to do was trivialize the concept of a release, right? Because if you think about how we did it before, which was something like this, and you guys might do something like this, every time you check in code to your source repository, you'd have a Jenkins job or something, kick off a snapshot build or a beta build, and you run some unit tests to make sure everything still looked okay. And then you basically throw it away. And you do this over and over again. Then Tuesday comes around or something and it's like, hey, it's time for a release. And now you actually deploy it. And so this fear of actually deploying all those sort of intermediate things was really just creating this potential backlog of backlash when you actually deployed it to production. So in this sort of dev ops, you continuous deployment, continuous integration world, every build is a release candidate, right? And so basically the first thing we did was say, okay, well, we're gonna be deploying Docker apps. Could do a whole talk on this. We've actually done a whole talk on this, but basically this is like the node that we want to deploy our Docker apps, those little green apps. It kind of started from the bottom. We used Chef to basically provision the box and it installs and configures Docker and some other companion services on the side, right? Like, oh, I can't really log to Docker. I have to shoot those off somewhere. So we use things like Logstash to shoot those off and that black box is console. It's a eventually consistent service discovery that we use for that service discovery between the orchestration service and the Docker containers. And then over on the left you see, I don't know if you're familiar, but at the top it's Marathon and that M thing is Mesos. We basically use this combination of Marathon and Mesos to deploy Docker containers across a farm of machines basically, right? So, but really the deployment step looks like this, right? You have some step in your Jenkins pipeline that goes to Marathon and says, hey, launch this or upgrade this and it basically just does it, right? Which kind of leads us to this continuous delivery business, right? And so what we really wanted, if we were gonna do this sort of minimal touch point from people, we wanted this idea that when you commit your code to the code repository, it's going to production, right? And so you needed some kind of gate, right? Security guys don't like it when you say, well, any developer can just kind of make a change and push it to production because they might be evil or something. And so you need some kind of oversight, right? You might want to run awesome automated tests before you actually do this. And so we kind of adopted this pull request model. The idea being that when you fill out your pull request, that little merge button is grayed out. Somebody has to review it. You have some sort of minimal number of approvers. They can't merge it either. They can only approve it. Then you have this little robot come along and kind of do the merge. And then from this point on, everything's kind of, you're commencing the pipeline to production, right? This is human-free after this point. So the Jenkins pipeline is basically triggered by, so we use Atlassian Stashes as our internal Git repository. We have a Git hook that basically triggers the pipeline. There's sort of a build and package the Docker app and then some various deployment steps, as well as some paper tickets for things like, every change to production needs a change ticket, right? Now, the other thing I'll mention that, oh wait. Okay, so this is kind of a simplified pipeline, right? We really have like 20 environments, right? Some, when we deploy to production, we actually deploy to like six environments in parallel. So just to kind of keep it simple, it looks like this, right? You do your build, you do your unit test, you create your little Docker artifact, you push it to your local, your Docker repository. We deploy it to a dev environment where we run some acceptance tests, some more deployments. Eventually we open the paperwork, try and deploy it to production and then we either close the ticket successful or not, right? The build step kind of looks like this, right? The code itself internally contains like a version property file and you know, this is mostly just, you know, you version it the way that it makes sense, right? Major, minor versioning. So in this case it's like 1.2. Now, because every build is now a potential release candidate, you really need to tag and version everything. So this idea of a beta is just gone, right? Everything's a release. So we just use the Jenkins build number for that job, stick it on there, package the thing up in a Docker container now, right? Before we would have just pushed this to our Maven repository and then had some other tool run by somebody else due to the deployment when it was time. Now we package the thing up in a Docker container. This is a simple Spring Boot app, which is why it's just Java-jar. Package it up and then we internally push it to our Artifactory, which is our Maven repository. It does Docker too, but for most people this would be either Docker Hub or Docker Private Registry, I don't know what you guys call it. And so the rest of the pipeline at this point is just gonna sort of pull down this Docker image, right? Everybody with me so far? Okay, now the one sort of change that we made along with this, this little sidestep, is that in a traditional sort of Jenkins setup, you have a Jenkins master, you have a bunch of slaves and the slaves sort of have to have everything on it you might possibly need and we took a tip from an eBay blog post we found a while ago where they were using Mezos and they basically did something like this. So when their Jenkins job or whatever gets triggered or by polling decides it needs to do a build, instead of having the static pull of slaves, it would basically spin up an ephemeral slave which would then connect back to the master, perform the build and then publish your artifacts. And when you were done, you just kind of nuke the, nuke this like, you know how long I want to use that stupid flame thing, right? How many years ago did that come out? All right, so again, this is a, basically what this lets you do is it lets you create like smaller single purpose Jenkins slaves, right? One for building this, one for deploying that. And so you kind of have these little micro deployment bits, right? This is part of your Jenkins infrastructure. So there's a whole talk on this. I gave a talk of this last week at Mezos, kind of the video's not up. But I go through how to set all this up. It's pretty interesting if you wanted to at least test out things like Docker and Mezos, this is a nice sort of safe self-contained way that doesn't affect your actual application too much. All right, so now that we've got the build done and we've got our Docker container, now it comes to the deploy step. Now, traditionally when you are dealing with these sort of static deploys, right? You're moving it from environment to environment and you need something to do the launch that's gonna parameterize it, usually pass it as environment variables or whatever. But in this particular case, we're using that in a combination with marathon, where I just say launch this and I want three, right? And marathon in conjunction with Mezos will figure out where to put those, right? And so the deployment looks something like this, right? I pulled down my playbook from source code. You can kind of think of marathon as like a knit D for a bunch of boxes, right? It's very similar in a lot of ways to like Docker Swarm and things like that. So the playbook will basically, so in this case I'm already running, let's say version 1.2.16 of my application and now I want to upgrade to 1.2.17. So I basically make a call to marathon saying I want this new version of the code. And the call to marathon is actually asynchronous, it just kind of returns a deployment ID. Because this is Jenkins pipeline and we want it to block, right? Because you don't want it to go all the way to production before you actually see whether it was successful. We added a little bit of an ansible logic here to sort of make it look synchronous. So the idea is you check with marathon, is my application running? If it's not, you do a post. If it is, you do a put. And then you get this deployment ID back and you just kind of pull for a while and wait for the deployment to finish. Now, at this point, marathon interacts with mezzos, figures out where to deploy the stuff, deploys all the new versions. When you deploy an application using marathon, you can also give it a configurable health endpoint that it can check, see whether or not you were able to actually come up properly. Did you connect to your data sources? Are you okay to continue, right? And if that happens, it basically then shuts down the old stuff, otherwise you would abort it and it would stick with the old version, right? So this works great. And the really nice thing about this kind of setup is that even if you have... Oh, so this is basically like one environment. So this was like depth, right? Now we do this again for like staging and so on, right? But what's really nice about this is that marathon providing sort of those NITD capabilities, if something dies, right? Either because it just outright dies, right? Everybody says, oh, my server's gonna die, right? Servers don't die that often. But what actually happens is a lot of these in our environment are like VMs and sometimes the VMs will just die or they need to do service on the box and they need to shut it down. And so your sort of hardware just comes and goes. And if this happens, marathon will just kind of figure this out and then go find a new home for that missing capacity because it knows it's supposed to have three. So it's kind of a layer above what you typically think of when you're dealing with like Chef and Puppet and kind of dealing with an individual server. It's like a layer above that, right? And everybody kind of has to build something like this. Although there's a lot of work around this, which I'll talk about in a second. All right, so then there's like other steps, right? So we have other Jenkins steps that have very special purpose slaves that will do things like tests. So in this particular slide, right? I'm showing, oh, I have this Jenkins job that is probing our service discovery to actually figure out where all those instances got deployed so I could do something like, check and see if the leader election among these three actually was successful, right? Or something like that, right? But the idea is to kind of push as much of the unit tests to the build step, but once you start deploying it, start testing the actual running app, right? Other steps are things like the paper, the paperwork that we talked about, right? So if you work in a big company, you probably have a requirement like this where every change requires a change ticket, right? Which usually means a trip to the cab and it's the same day every morning and you gotta get up and you gotta go, please, please, can I deploy my stuff to production, right? That's not DevOps. And it turns out that that's usually something you put in place because somebody put it in place a long time ago, you don't even know why you're doing it, right? And if you actually go ask, like, why do I actually need this ticket? They say, well, we need this to be recorded, right? And so it turns out that unless you're talking about like your financial apps or things like that, they're okay with you not asking permission to deploy your code because it's relatively low risk. And so you can actually flip this around and instead of your change ticket being asking for permission from humans to basically being an automated record that this thing happened. And so what we do now is when we get to this paper, when we get to the, if we can make it as far as the production deployment, we will look through the commits, we'll pull all the JIRA codes out, we can create a nice rich ticket that says this is everything that went into this. If a person did this, it would say, deploy X version two to production, please, right? And then you got to go dig around for all that information. So when people are scrambling to figure out, oh, what changed, what broke, you now have much more information, right? So at the end of the deployment, let's say it failed, we'd go and we'd go update that ticket, we mark it as it failed and we close the ticket. And now if you wanted to do a new deployment, you basically start all over. If everything was fine, you close the ticket successful, everybody's happy, the paperwork people got what they needed, you got fast deployment, right? And so, sorry about the eye chart here, but this is basically meant to demonstrate that the people is at the beginning, right? And everything else is just kind of automated, right? This is what you wanna go for. All right, so with 20 environments, we're able now, this is about a 10 minute cycle for us to do this. So I'd say this worked out pretty well for us. Okay, now everybody always asks, well, why didn't you use this, that, or the other thing, right? There's always a big bag of technology that you kind of have to assemble into your platform. Everything I showed you here wasn't necessarily the first thing we tried, or necessarily we're totally happy with, and everything's replaceable. And so, but there are some technologies, especially around like Docker deployments across like large things that look kind of interesting and are still emerging. The first is this Kubernetes, if you're not familiar with it, it's a project open source by Google. It's kind of how they internally manage their sort of pools of containers running across lots of machines. If you're in Google Cloud, definitely we're checking out the integrations while all the other sort of cloud providers as well as private services are still coming. If you're in Amazon, definitely want to look at the Elastic Container Service. It's actually very similar in a lot of ways to the Mezzo stuff, although they swear they didn't borrow any of that. And then of course, Docker, the companies, doing a lot of great work around Docker Swarm. There's actually, for instance, like some Docker Swarm to Mezzo's integration, but we're keeping an eye on this because this may actually become sort of the standard API that will replace something like communicating with marathon across pools of machines for us. Although right now is my understanding is it doesn't do like supervision if something dies, it won't restart it. So now these last two things at the bottom, Heshkoper-Vault and Rancher Convoy, I'm mentioning these because having, somebody always asked me at the end of the talk like, oh, are these stateful applications or stateless applications? And right now these are sort of stateless, you saw before we connect to external data sources. And so the two sort of big gaps right now, I think, in sort of the Docker space is a lack of an out of band way to pass secrets, that database key to your container. A lot of people use tricks like, oh, I'll use Chef encrypted data bags, I'll throw it on the box and I'll kind of mount it through a volume. That works, we do that a little bit, but it's not great, right? And then things, and at some point we're gonna move stateful services in Rancher Convoy, they just open source this, it's basically like this uses the new Docker 1.8, pluggable storage layer stuff so that you can do like volume management in Docker. So I'm looking forward to see what happens with that. Okay, so in conclusion, right? Whatever your sort of deployment scheme is, however you assemble your bag of technology, you wanna take a step back and start with that sort of pipeline, right? How do you get from checking code in to production and eliminate all those steps? Eliminate as many steps as you can, definitely eliminate all the people steps. The stuff that John was talking about before about immutable deployments, we really kind of love this idea where if I built it this way in staging but a different way in production with Chef, how do I know I'm actually running the same thing, right? It could be configured completely different. We've had situations like this where something that worked in prod that worked fine in staging because of a configuration issue, right? So the idea of using Docker to create these repeatable apps using something like Chef to create the repeatable infrastructure because as great as Docker is, it has to run on something that's been configured in our case we configure Mezos and things like that as long as all those companion services that it needs. And then using something like Jenkins or pick your favorite sort of pipeline tool that you use, we were already using Jenkins which is why we went with that so that your process is repeatable and it's basically as hands off as possible, right? And then of course the sort of the evil configuration that always causes the problems, right? Well, what we've tried to get people to do is that if you know it at build time, put it in the Docker container, right? There's, I've always been taught as a developer, oh well, your thread pool size, oh put that in a property file, right? And pretty soon your property file has 80 bazillion things in it and it becomes unmanageable. I'm not gonna change that from release to release. So you wanna just bake that right into your Docker container, your Amazon image or however you're doing your image-based deployments. All the other things that are actually different from environment to environment like connected to this database in staging, connected to this database in production. Pass those in as parameters at launch time. If you need to change it, you reboot, right? For the first one, if you need to change that thread pool size or whatever, make a new Docker image. It's easy, right? And then for things that are sort of outside of environment and sort of known compile time concerns, things like, oh what's the current exchange rate, right? Move those to external services, either data sources or something like a console or an SED or Zookeeper, right? Pick your favorite, there's lots of them. Because you don't wanna put this off too long, right? You're gonna start with one app, we started with one app and pretty soon you're gonna have hundreds. If you're successful, you don't want your process like beating you over the head as time goes on, right? You wanna tackle this early, right? So don't put this off. And with that, that's all I got. So I think I got, what, like three minutes for questions? Yeah, and we'll pass the mic around. Be kind. When you have, hello, that's loud. When you have hundreds of endpoints and hundreds of service behind that and then of course you have to multiply it for redundancy and multiple data centers, you end up with thousands and thousands of nodes how do you avoid like being beaten to death by monitoring and automation on thousands of nodes? Because I'm hitting the same thing, we have hundreds of endpoints. The developers have become addicted to every new service has its own endpoint and cluster. Yeah, okay, so that's a good question. So yeah, most of what I talked about here is mostly around like the deployment stuff. Obviously the monitoring is very, very important. So there's some stuff that we use during the deployment cycle like the health endpoints to see whether or not it's successful or healthy or not. But you could get in a point where, like in this particular case you have marathon kind of watching everything. So for the most part it will like restart things. So most of those things that you, like catastrophic failures would be dealt with automatically. But there are most of the devils in the ones that are still okay, but they're really, really slow. And yeah, you do need to watch for those. And in our particular world, what we do is all the applications emit metrics to a central location, it's basically graphite. We are in the process of rebamping that to run it through like a sort of Kafka storm kind of thing so that we can do more processing on it in real time and actually react to it faster. But for the most part it's some combination of setting up a place for all these metrics to go. Then you have to sort of create these sort of different views, right? There is the site broken kind of view that is at the very highest level. Then you want sort of the teams to be watching their stuff, right? Because at some point you do have more things than eyeballs can look at. And you still don't want, and you still don't want the eyeballs on it. Eventually you want to be able to query your time series database or whatever and say when this is above this, alert somebody or do something or restart something. That's really the goal. But you have to be able to spread the pain so that because as a developer, if your stuff is broken all the time in production, you probably just don't know it, right? Because breaking the site, that's what you hear about, right? You don't hear about the stuff that, oh, your application is throwing errors every three seconds, but it's not impairing the site. As a developer, you could still go fix that, right? It doesn't require all hands, but it does require that you create that feedback loop, right? There's tons of good stuff in there for open space. Other questions? Well, I'll be around if you want to ask me later. We got one more question. Yeah, sorry. Hi, my question was more about the process to go from what you had before to this. How long did that take? How many team members? What was the collaboration like? So for this particular experiment, as you saw, we kind of extracted the stuff behind the existing services. So the entire rest of the system was exactly the same, right? So nobody really even knew we were doing this. And there was that moment where somebody's like, you're not doing it like all the other stuff, right? You're creating this potential Pandora's box, right? Of now we're gonna have two things. And that did actually happen. The actual collaboration was maybe the between development operations was maybe like two people in operations and maybe like six developers. And we just kind of decided, we just kind of went back and forth and figured out we're gonna pick this, that, and the other thing, try it out and just kept iterating on it until we found something that worked. But at some point when it became, well this is in production, this is a real thing, they actually, the developers, actually went and had all those conversations with security and change management to sort of make sure that any and all of their concerns were sort of dealt with for this particular application, right? And so now this approval chain became a one-time thing, right? If you create a new application, they wanna know what's the impact if it occurs, is it gonna be a huge financial impact if this occurs, what's the website gonna look like? Once you've addressed all of that, they're okay with you just kind of doing these continuous deployments because they know what you're doing, they know what the scope is, right? Did that answer your question? Yeah, there's a lot to unpack and seeding and catalyzing organizational cultural change, so that can be an open space too. Let's thank Steve. Thanks guys.