 Good afternoon, everybody. Thanks for being here late on a Friday afternoon after everybody's brains are completely full. But we survived the snow, so we can survive anything. So I'm here to tell you a story. I'll take you through the journey that credit karma has been on over the last couple of years from monolith to microservices, like it says. I'm Mason Jones, a staff engineer, heading up our infrastructure services team. We're the team that gets to play with all the fun stuff that we've really been all hearing about this week, Docker, Communities, Likerd and so forth that I'll be talking a little bit more about. A little bit about credit karma. Our mission is really to make financial progress possible for everybody. We serve our over 75 million members by helping them understand their credit score, their credit history, ways to improve things, ways to find better financial opportunities. We're just over 800 employees now, half in engineering. But when our story starts two years ago, we were half the size, 400 employees, half in engineering. And we'd already been growing very, very rapidly over the last little while. And we were facing a monolith. I'll take you through these steps during the talk and kind of tell you about the various phases that we went through. But I'm actually curious first to kind of start off, how many of you have a monolith that you're either preparing to get out of or are in the process of trying to get out of? Yep, it's a really, really common story these days. How many of you have already done that and you're just here to gloat? Cool, congratulations. Our reason for moving to microservices from the monolith, and there are lots of different possible reasons that you all may have, for us it was really to try to get teams independent. We had learned, as probably a lot of you have learned that engineering growth paired with a monolith leads to great sadness. And we wanted to get teams to where they could really be independent, do their thing, move at their own speed, not deal with the monolith deployment cycle and the dangers that come from making changes to a monolith. And we wanted teams to be able to experiment as well. It's much, much harder to just try something out in a monolith because the side effects are unknown. So that was really what we were after here. And when we made the decision, microservices is gonna be our future, we tried to figure out how are we gonna get there? Where do we start on this? And we wanted to begin really with some baby steps. We had a few people in the company who had experience with Docker, mainly in development environments and kind of playing with things. Not a lot of experience running Docker in production. So we wanted to play with that first. We wanted to figure out what was going to bite us, right? So rather than doing the tempting thing, which is to identify the piece of our monolith that's causing us the most pain that we really wanna get out of there that we really want to free from the monolith itself, we chose a couple of really relatively unimportant services. Don't tell the teams that own them that I said that. But we really wanted to play it safe. We knew that we were gonna have a lot to learn. We selected those and we then wanted to use the tools that we already had in hand to get things started. We run to make this more fun on bare metal in our data centers. We use salt to provision our servers and we run CentOS, so the supervisor kind of comes with the package. So we thought, all right, let's see what we can do with those. We're not gonna try to bite off more than we can chew. We're not gonna grab an orchestration system and spend six months trying to figure it out before we make any progress. We really wanna just start. We wanna get this going. So this seems kind of weird, I'm sure, but we said, let's set up deployment of a service by using salt to provision files onto the servers where we wanna run it. Those files are supervisor configs, telling it do a Docker run. We provision those files, we kick supervisors and it does a Docker run for us. If that container dies, supervisors are gonna restart it. That's its job. It does that pretty well. Now, provisioning files with salt and telling supervisor to do a Docker run sounds really horrible and it kind of is, but it got the job done. We were able to get some containers running, get Docker installed, get some experience with that, try to upgrade Docker, fail horribly, get some experience with that. We learned the hard way that upgrading Docker is not always as straightforward as you might expect, but this did work for us. So, at the beginning of this year then, we've actually gotten to about 15 services or so running across 20 servers in our data center using this mechanism. We built some deployment tools that would interact with salt, use some weird Ansible stuff as well, get some files out there, kick supervisor, and allow teams to deploy, to update their images, to restart the containers. It's working. So, now we've got some pretty good experience running the stuff in production. We've figured out some logging and some monitoring and some of those important things and the teams that are building services are actually pretty happy with this. They don't see the kind of ugliness underneath too much. We do, and we feel it. And we see a lot more services coming down the line. And we knew that this is not a scalable thing for our team to continue managing. It's really obviously super inflexible architecture. We're hard coding. Where are these service instances running? What servers have which ones? We're human orchestrators. Honest to God, using a Google spreadsheet to track here are our servers, here are our services, and here are what ports they're on. It works, but it's not great, right? So, how are we gonna get past this? The other thing that was actually biting us that was causing more heartache to more people was that teams would develop their service on their laptop. They'd get it running. They'd go to testing. They'd do their integration testing. It would be working. They would deploy a production and it wouldn't work. The plumbing, the moving parts, the configurations were really very different in those environments. And that was not making anybody happy. So we wanted a way to give the same experience across all of our environments. But again, baby steps. What's the next thing that we needed to do? We still got a monolith, obviously, but we've got some services going now. So that's pretty good. The next thing to start to make things a little bit more flexible, a little bit more automatable, a little bit more looking into the future was service discovery. So we're running console already in our infrastructure for other reasons. And it makes a terrific service discovery tool. So we modified our deployment tool such that after getting the service onto the server, getting the config set up, it would tell console about it. So now we've got something we can query to tell us what's running where. That's quite an improvement. And we've got some decent health checks. Because prior to this, a supervisor would alert us if something died and it restarted it. But if the service stops responding, but the process is still there, pretty invisible to us. We would often not really know about it until someone reported that traffic was just not flowing properly. So this is an improvement, but it's incremental. We're not any closer to a dynamic infrastructure. So still got a monolith, got some services running. We're registering them now. We know where they're running. We know if they're healthy. That's pretty good. So what can we do next? We know that we're moving towards orchestration at some point. You know we're moving towards it because there's a box that says so. We wanted to leverage the service discovery now to actually send requests intelligently based on it. So dynamic routing is next. This is what our routing looked like at this point. If the monolith wanted to make a call to a service, which is our most common case early on, it would make the request to an nginx proxy running on its server. That nginx proxy would forward it to a VIP load balancer, hardware load balancer in our data center, which has been manually configured to know where instances of the service are. Not great. The VIP would then send it to one of the servers to an nginx proxy on that end. The job of that nginx proxy makes a little bit more sense, which is to allow blue-green deployments. Our teams wanted to be able to have traffic going to their blue instance, deploy to green, and then ramp 1% of traffic to it and get a sense of whether things were working. Because keep in mind, we're trying to roll this out. We've got these services going while we've got 75 million members using our site, and we're starting to move some actually important functionality into these services. So 1% of traffic is enough to watch the logs get a sense of whether things are healthy or not. We did this by having our job and our deployment tool update the nginx config, send a signal to nginx to reload the config, and then we could actually adjust the weighting on the fly safely. So you're probably looking at this and wondering, why is there an nginx on the other side? And the reason for that is those green arrows represent traffic going between hosts in our data center. Being credit karma, security is the first thing we think about at all times. And one of our requirements is that all traffic between hosts must be encrypted, must be TLS. nginx is there to ensure that security and for performance reasons. I don't think I mentioned earlier that to make things more fun, our monolith is not just any monolith. It's now a seven-year-old PHP monolith. PHP does not do a nice job of holding connections open with the result that every service request would have to open a TLS connection, send the request, wait for the response, tear it down, and do that over and over again. The latency was not acceptable. It was not working for us. So we put an nginx proxy on that side to maintain the TLS connection to the load balancers. And that actually got the job done. But it was another piece of essentially hard-coded infrastructure that we were putting on our servers. So how do we get past this? How can we make use of console now to actually do better and more flexible and actually automatable, eventually, routing? And LinkerD handed the picture at that point. This is probably the only audience that I don't have to explain a service mesh to, because everybody's been hearing about it for the last several days, I think. But in our case, LinkerD represented a service mesh, deployed onto all of our servers, talking to console, so it knows where things are. This removed the need for the manual VIP load balancer configuration, removed the need for that nginx, because LinkerD actually does TLS connections between all of the hosts for us. And it got us ready for actual orchestration, because now when we deploy a container somewhere, it will register a service discovery and something will make use of that to actually make sure requests reach it. So at this point, routing looks like this instead, which is a whole lot cleaner and a whole lot smarter. The monolith sends a request to LinkerD on its host. LinkerD looks at that request and says, you want service foo? Great, I know where those are. I'll pick one. I will send your request to LinkerD on its host. And that will then, at its end, send to either blue or green. And we modified our deployment job to do the waiting between blue and green to make use of LinkerD's name or d component. We can use the API rather than actually touching configs and telling it to reload, which is also much, much nicer. So we're making some progress here. We've still got our monolith, but we've got a bunch of services. We've got services discovery in place. We've got dynamic routing making use of that. So we can start to look at actual orchestration now. We had actually started to explore orchestration late last year, late 2016, just to kind of get an idea of where we were going to be headed in the future, in case it might influence some of our decisions along the way. At that time, the landscape is fairly different than it is now. We were looking at Mezos. We had people with experience running Mezos. We looked at it and felt that Marathon and Aurora for continue orchestration were really kind of not keeping up. They weren't really evolving very much. It didn't look like we could count on them continuing to move forward the way that we wanted. We looked at No Manage from HashiCorp, which was really actually appealing, very simple, very small, easy to manage, small sort of functionality, but probably enough for us at the time. But it was very new, and we couldn't really find good stories of people running it at scale. So we weren't really confident enough at the time. We looked at Docker Swarm. Oddly enough, kind of the same story. Not a lot of production usage at scale. And we were also a little leery of putting all of our eggs in one basket. Rocket containers look pretty interesting too. We weren't sure where things might be going. So Kubernetes at the time was also kind of an open question. It was moving super fast, but where was it going? It was kind of hard to keep up with. We weren't quite sure. Fast forward just six months, only six months, early this spring, mid spring. The landscape had changed. And thankfully for us, actually, the decision was super obvious. And since we're all here, we know what the decision was, which was Kubernetes. It was a very easy decision at that point. For us, the main deciders were, number one, the community, which, as we all know from being here, is ridiculously active and excited and moving things forward, which is awesome. The abstractions that Kubernetes provides were perfect for us. Everything from pods and demon sets and deployments and services all mapped beautifully to what we wanted to do and what we were trying to do on our own in a half-baked way. And finally, by the spring, it was really pretty easy to talk to people and find Kubernetes installations running at scale. So we felt pretty confident that it was going to be able to move forward and grow with us the way we expected. But having decided on Kubernetes, we had to figure out how to install this thing. And this was interesting because you run Minikube and you're like, great, this is awesome. Going to go to production. How do I do that? And to make it more fun, we run on bare metal in our data centers, but we also do run workloads in multiple clouds. So we needed something that would run pretty much everywhere. So that ruled out cops. No bare metal support there. QBADM is coming along now, but at the time, it really wasn't ready. It wouldn't give us a full HAC config, at CD cluster, all the things that we needed to really be production ready. Tecdonic was pretty cool, but we can't run CoreOS in our data center for various esoteric reasons. We looked through dozens. We created a spreadsheet of installers and trying to figure out the feature set and what would do what. It was kind of crazy. But we ended up choosing Kismatic, which was at the time just adopted by Apprenda, a small but active community, but some support, good documentation. And most importantly, it did what we needed it to do. It would run on bare metal. It would run in the cloud. It was pretty flexible for configuration. We could actually set up the same tool set so that we could stand up our bare metal cluster with a full HAC config, but then stand up a mini one in the cloud really easily for testing using the same tools, which was awfully nice. So we've got a cluster in production now, and we're ready to start using it. And how do we do that? Because we're not going to tell the teams, yeah, can you stop developing for a couple of weeks while we pick up your services and move them over into Kubernetes and change all the tooling, too. So it's talking to Kubernetes and all of this. And that was not going to work. So we were really back to our initial baby steps. Pick a service that's not in the critical path. Work with the team to move it over. Talk to them and say, can you work with us to move your service over? And that was actually really, really the easy part. We talked to a bunch of teams to say, hey, could we work with you to maybe move your service as the first one into Kubernetes? And they were all like, pick me, pick me. Because they could see that it was going to make life so much better. So that was really nice. The main question that we had to solve, though, was, as we're moving some of our services into Kubernetes and we're continuing to run our existing cluster, how do we manage the routing? We've got services calling each other in and out. And service mesh is great if it's going to run everywhere. But is it going to run everywhere? And we lucked out in a big way here. I'd love to say that we planned this ahead. And we knew it was coming, but no, we didn't. We actually lucked out because LinkerD happened to have the capability to do this once we started digging into things. LinkerD's NamerD component is what talks to service discovery. And LinkerD has the capability to talk to multiple NamerDs. So what we ended up with was a cluster that looked like this, where our Docker cluster has a NamerD talking to console. Our Kubernetes cluster has a NamerD talking to Kubernetes. The LinkerDs everywhere talk to both NamerDs, so they know about the union set of all of the services. They know where they all are. So if a service in our Docker cluster wants to make a call to one in the Kubernetes cluster, it sends the request with local LinkerD. That LinkerD looks at it and says, oh, yeah, you're looking for that service. I know that there are a bunch of those over there. They happen to be in Kubernetes. I'll pick one. I will send the request to the LinkerD on its worker, which is interesting because LinkerD needs to be then a demon set, node port. It needs to be kind of known and addressable from outside, because it's not just going straight to the pod, obviously. So the request gets to that LinkerD. That LinkerD does its thing, sends to the blue or the green instance of the service, just like before. We'll go in the other direction. Essentially the same thing, right? The Kubernetes service sends a request to its LinkerD. LinkerD looks at it and says, oh, you're looking for that service over there. That's in that cluster, the old Docker cluster. But I'll get your request there. Picks one, sends a request to the LinkerD on that server, and it gets to the service. And this works. At this point now, we're done, right? Not really, but we're really making progress. We've got orchestration. We're in the process of migrating. And we can see that we're moving into the 21st century finally as far as running services. There's one other element of this I wanted to mention just because it's kind of an interesting and important one, which is configuration of services themselves. In our old Docker cluster, what we did was leverage our old friend, Salt. Again, the team has a config file for their service. It's in Git. When they deploy, Salt pulls that config file out of Git, provisions it onto the server. That's part of the supervisor Docker run command. It mounts that config file into the container, and it's got its config. That's pretty horrible. So we took advantage of the move to Kubernetes to actually move things into a better state, where configuration data for services is now in console, because we've already got console out there, and we know how to run it. And the secrets, database passwords, API tokens, and so forth, are in HashiCorp Vault now. So when services start up, they pull their configs from console, they pull the secrets from HashiCorp Vault, and there they go. Getting services that are started in Kubernetes, the token that they need to talk to Vault is an interesting thing. It's getting better with the latest versions of both Vault and Kubernetes, but it does work. So in the process of doing all of this, we obviously learned a ton, and some of the high points were, number one, we lucked out in our decision to start small and simple, for sure. If we had gone into this, taking the most important piece of our monolith and trying to chip it off, and setting up an orchestration framework, and trying to get that working, it would have been six months before we made any progress, before we learned anything, and we actually would have made the wrong decisions along the way without question. Secondly, we really underestimated the effort to integrate our existing tool set with Kubernetes, and for that matter, Linkerd and all of these other pieces as well. In kind of moving along all of these steps, we were kind of uprooting a bit of the development experience at each step along the way. Trying to modify deployment tooling as we go from weird salt and supervisor stuff into Kubernetes is a big thing, and that impacts all of the environments from development to testing to production along the way. One of the things that I mentioned earlier is that we were trying to get to a state where the environments were really the same. The experience was the same running your service regardless of what environment you're in. Naturally, Kubernetes being configuration based, being declarative, made that much, much easier, and we actually had a slide by head to take it out for time. We levered Helm for this, which was terrific. Helm let us just have the developers provide a tiny little config file. We layer on all of the environmental specific stuff depending on where it's being deployed to, and it just works and they don't need to worry about all of those nasty details. So that was a big win, but it took much, much longer than we expected to make all of the tooling work with all of this stuff. I already kind of mentioned our surprise at finding that we had dozens of installers to choose from and we had to figure out which one was gonna do the job. And it feels to me like this is an opportunity for the community to really just rally behind one, make it the installer to rule them all and just help with making the onboarding of people starting out with Kubernetes much, much easier. I think that would be a great thing. And lastly, but not lastly, is security. Security touched all of these things as we were going along. And as you're starting to move into distributed services and you're starting to look at things like service meshes and all of these things, security gets very interesting. With Linkerd and the way that we set up the service mesh with one instance per server, basically means that every server can talk to every other server. And you just have to kind of be okay with that if you're gonna be okay with that. Alternately with service meshes, you can obviously run one per pod in Kubernetes. Wouldn't help us outside Kubernetes, but it would give us the better path, which is really where we're gonna be moving to, where you can lock down access between pods somewhat easily. And kind of lastly, running Kubernetes involves like a dozen Docker images, right? For us, we had to pull all of those down, put them in our private registry, scan them, get them approved, make sure that they're cool with security and move on from there, which makes upgrades tons of fun. So we're still moving through this. We've still got our monolith. We've accepted that we're gonna have our monolith for a while, but it's getting smaller rather than bigger. We've got more and more services going. By the end of this year, we will, sorry, by the end of 2018, we'll probably have a couple of hundred services the way it's going, but we've slowly moved our way through tooling, understanding along the way what we're trying to get out of it, and we feel pretty good about where we're going, and tools like Kubernetes, LinkerD and so forth have made it much, much easier. So hopefully some of what we've gone through, some of our thoughts and decisions may be useful for folks who are tackling this now. And thank you very much. It looks like I have a little bit of time for questions if anybody has any. Yeah, the question is about how we're using Helm, and we have standardized on all of this so that every service runs the same way, and the developers don't have to think about deployments, services, or any of those objects, really. All they know is that they've got a service, it's called this, they need X number of instances of it running, and for the most part, that's all they have to change. We allow overrides for, we run services in both Scala and Node at this point, so we do allow overrides for things like Max Keep Size for Scala services, just because the JVM is annoying. So they can put in some of those, but then it's a very small file that they have to think about, and then we layer on everything else at deployment time based on where they're deploying to, if it's going to testing, then one instance if it's going to production another, and we also have multiple data centers that we deploy to in parallel, so we kind of take care of abstracting that so that they don't really have to think about it. So the question is about performance degradation using LinkerD. We didn't, actually. Now that said, when we first adopted it, we were pretty primitive in our metrics, so it was a little bit of best guess, and like, yeah, this feels pretty good, but we've done measurements since then, and it's really pretty minimal, very acceptable for us. No hardware load balancers anymore, which is really, really nice, because they actually do have APIs, but nobody quite knows how to use them. Anything else? Yeah, so the question is whether we started, well, whether these services were pieces of the monolith or new functionality. We purposely started with new functionality, actually, because we knew if we were trying to take existing code and put it in this new thing, it was going to be more difficult, and ironically also, we had, as a company, started to adopt Scala a bit, and we were building frameworks for these services, and we kind of asked all of the developers, we're thinking about just doing the framework for Scala and abandoning PHP, how do you feel about that, and maybe not surprisingly, everybody said, yeah, we don't want to do PHP anymore, Scala's great, so it was all new stuff, all being written in Scala, and as we've been carving pieces out of the monolith and replacing them, we've actually been wholesale rewriting them into Scala, so we really have no PHP services running in production there. Yeah, so the culture question, yeah. I was surprised, actually, that it was fairly easy. We have absolutely gone for the you build it, you own it mentality, and there's been very little resistance to that because the trade-offs of not having to deal with the monolith are well worth it in everybody's minds. We've obviously had to train people on understanding pager duty and responding and all of this, but the harder thing that enables the culture has been providing the tools that they need to be able to manage their services properly. Again, going back to security, almost nobody at our company has access to the production servers. There is no SSH to production ever for developers, so we really had to approach this from the standpoint of making sure they have their logs, they have their metrics, they have dashboards, they have the alerting capabilities, and they have the ability to really go in and understand what's happening when their service starts to act up. And I'd love to say that we've got that really well solved, but we're still evolving it and probably will be for a while because that tooling can never be too good. We have a custom-built tool for deployment, actually. And again, that goes back to the security reason. We've looked at things like, well, all the various tools that have been in the sponsor area for doing Kubernetes deployments, and really none of them work for us because of the security access problems or challenges, I should say. So we built our own, it's kind of this combination of Jenkins and Ansible, and everything flows through one point that's very carefully monitored. We log every single activity that it does, and so, yeah, we had to build our own that way. It did get much, much nicer with Kubernetes, that's for sure, because we can actually just, we wrote a little go utility that we just called deploy tool, because we're imaginative, and it's responsible for just taking a command line and then it calls the Kubernetes API to do what we need. And that also kind of limits access. You can only do what that tool will let you do. And so from a security standpoint, it locks things down a little bit more as well. Yeah, the development environment, that is still in flux right now, honestly, because we've gone from a day where everybody could just run a copy of the system on their laptop in a VM, actually, is how we did it. And now with all of these services, it's too big to run on a laptop. So we've had, like everybody else I've talked to had to figure out how do you do development with a minimal set of the system available. And we're pretty much kind of saying do that as you like and providing the tools to make it fairly easy. And then figuring out which services we can provide in a shared way up in the cloud. So everybody kind of uses one instance of that one. That only works for some services that essentially kind of work in a multi-tenant way, because otherwise, as you know, someone makes a call, changes the data, then I make a call and it doesn't work anymore. So we're still kind of evolving that a bit. But from there, we're figuring developers are doing their unit testing, they're running in their IDE, they've got their stuff kind of working. And then we made it very easy to essentially push a button and spin up a tiny cluster, basically a one node cluster for now in the cloud. And then get a URL to be able to hit that and actually do kind of integration testing with everything available. The data migrations, as we're kind of chipping pieces off the monolith, yeah, it's painful. We've done it in a couple of different ways, actually primarily depending on the size of the data. Sometimes it's just too much to like do during a maintenance window, because we could in theory kind of deploy the service, the new service, get it ready, wait for a maintenance window, do a data migration and then turn it on and hope it works and see how that goes. But we've also done kind of rolling deployment of the service where initially it'll be using the old table and it'll be reading and then it'll be double writing into a new database because we have definitely gone for the no database is used by more than one code base ever because then you've just got a different monolith or a different looking monolith. So there are definitely cases where you've had to give it a couple of weeks of running until most of the data has been double written and is in both places and then kind of do one last update during a maintenance window and then we're good and we kill the old table. But yeah, it's painful sometimes. All right, I think we're out of time. Thank you very much.