 Hey, testing, okay. How's everyone doing? Awesome. Is everyone excited to be here? Yeah. You guys been enjoying the conference so far? All right. Awesome. What's your favorite talk so far? This one. Correct answer. Yes. There was only one right answer to that question. Anyone else from Boston here? Oh, awesome. Even better. That's fantastic. Greatest city in this country. Yeah. Yeah, I mean, you know, actually it's pretty cold here. Yeah. Yeah, I thought, yeah, I did not pack appropriately. Yeah, it's supposed to get warmer. Yeah. Well, you know, come to Boston, you can have cold, wet, and snow all on the same day. Yeah. Yeah, sometimes. Yeah. Yeah, it's. Yeah. Okay. I guess we're ready to get going. So hello everyone. My name is Raffy Schloeming. I'm the CTO and chief architect of a company called DataWire. I'm Phil Lombardi and I'm a platform engineer at DataWire. I've been working with Raffy since late 2015. Yes, had to. So at DataWire, we build tooling for doing microservices on top of Kubernetes and Envoy. And what we're going to talk about today is how to go about using Kubernetes and Envoy to get a whole lot of developers working together and building awesome stuff. We call this bottom-up approach to microservices, service-oriented development. And I'm going to talk about why we think this is one of the most effective perspectives to benefit from microservices. And Phil's going to talk about the platform we ended up building for ourselves to actually do service-oriented development for the services we were building. And of course, my keyboard is frozen. So first, I want to start with a quick poll, though. Show of hands. How many of you have asked these sorts of questions about microservices? Stuff like, how do I break up my model? How do I take my app? What kind of tooling and platform do I need to actually benefit? OK. All right, that's a good number. So these were the same exact questions we had when we started out, and we kind of learned the hard way that these weren't really the right questions to be asking when you start out. So what is the right question? Well, I'm going to talk about that. But first, I'm going to cover a little bit of our early history so that you can understand the painful lessons that go along with this question. So we were founded in about 2014. My co-founder and I were both sort of tool builders. I had a big distributed systems background. So we wanted to build a company that helps people build stuff. Microservices were the way people were building things. And so it was pretty natural that we wanted to help people build stuff with microservices. Of course, we had a lot to learn. So we started out. We dove right in. We spoke to all the big guys who were doing microservices at the time, still our Netflix, Twitter, Yelp, Google, Facebook, and more. And we pretty deeply studied their architecture and tech stacks that they were using. And pretty quickly, this picture of an emergent architecture formed. It looked something like this, a control plane and a data plane, a control plane doing a bunch of stuff like discovery, logging, tracing, and metrics. The data plane basically consisting of a bunch of smart endpoints with resilience for L7 network stability, things like timeouts, rate limiting, circuit breakers, all that good stuff. And so we looked at this and said, well, that's a lot of technology to build just to benefit from microservices. So we can help people by building some of this for them. So that's what we set out to build. Sort of a control plane in the cloud. We used microservices because it was already decomposed into a bunch of separate services. So it seemed like a pretty natural fit. And things went pretty quickly to start with. We actually built a prototype and released it pretty quickly. But right after we launched, things started slowing down. And so we kind of took a step back, scratched your heads, and tried to figure out why things were going slow for us. And as is the case with this sort of thing, when you're in the middle of things, it's not necessarily obvious, even though it becomes clear later with hindsight. So we went through a couple of releases. We tried a bunch of different things. We picked a bunch of different tools to try to help accelerate things. We went through a bunch of different deployment systems. We took a look at our architecture and tried a bunch of different, we thought maybe we decomposed the problem wrong. We tried a bunch of different options there. None of that really made a difference for us. And after going through a couple of releases and looking back pretty carefully, asking a bunch of questions, we finally figured out what was going on. Every feature actually required pretty carefully coordinating the efforts of multiple different people. Some of this was for bad reasons, and we fixed those. Some of this was for very good reasons. We didn't want to break our users. And so in the end, we decided it wasn't really our technology or our architecture that was slowing us down. It was actually our process. Now at the same time, our prospects were actually moving slow as well. We had a lot of them. We were talking to 30 or 40 different companies. They really liked us. They kept on coming back to talk to us because we had a lot of valuable information for them. That's not supposed to happen. So yeah, they really liked us. And they kept on coming back to talk to us so we really could track their progress over a pretty long span. And they were all at different stages of migrating to microservices. And we started to notice a pattern with them. The slow movers kind of fell into two camps. Companies that were doing extensive tech bake-offs, very carefully choosing their technology. And other companies that were very, very carefully re-architecting their monolith sort of in a refactor process to get to microservices. And then there were the fast movers. And we really wanted to understand what was different about the fast movers. So we asked them a bunch of things. So what really stuck out was the story of their first service. All of their first services were different in some ways. There was a file upload service that was created because of the occasional multi-gigabyte file that would take down a Rails monolith. Another case was wanting to isolate the impact of auto requirements for PPI. And another company just wanted to adopt modern CI CD tools and didn't have time to retrofit their monolith. So they started building new features in separate services with more modern tooling. But the thing that all of these stories had in common was that they all had an urgent need that could not be addressed quickly enough in the context of their existing process. So we put two and two together with our own experiences and said velocity actually comes from process, not architecture. And this makes a lot of sense if you think about it. You don't stand in front of a whiteboard and architect a death star with thousands and thousands of services. You get there by enabling a different way of working that as a byproduct makes it way faster and easier to churn out hundreds of services. And this is when we ditched service- oriented architecture, fine-grained or otherwise, and started thinking about service-oriented development. And to understand this, it helps to recognize two things that dramatically impact the way we work as developers. First of all, in software, stability versus velocity is a pretty fundamental trade-off. The faster you go, the more things break. And this is why things slowed down for us after we launched. When we were prototyping, we could break pretty much whatever we wanted whenever we wanted. Second, we had users. We adopted a whole bunch of practices which slowed us down for good reason. And if you have a lot of users that rely on you, or even just a few that rely on you for something really, the more careful you need to be. And the slower you can move. And what this means is that if you're trying to quickly add features while maintaining stability, there really is no Goldilocks point on this curve for you. And that's because a single process is inefficient. It forces a single stability versus velocity trade-off. And there's another important factor to recognize. The development process involves a bunch of distinct activities that I'm sure you're all familiar with. And something that can be harder to notice when you're small, is that you can really only do one of the activities in this process at a time. Yet when organizations try to scale, they seem to do it by building specialized teams associated with some or all of these activities. And as an organization like this grows, this turns into some combination of dramatic underutilization and departments fighting each other. If velocity is a priority, then operations will get frustrated with developers for breaking things all the time. If stability is a priority, then operations will ultimately end up putting processes in place that will slow down development. And when development slows down, product management gets frustrated, and if your leadership doesn't understand this, they can end up setting up a combination of goals and organizational structures that literally pits an organization against itself. What this means is a single process doesn't scale. This is why instead of asking how to break up a monolith, we like to ask how do I break up my process. This is the question to ask if you want to go faster. And this is the perspective to approach microservices from. Because microservices lets you have as many different processes as you like. You can think of microservices as a distributed development workflow. And this lets you customize the process for each of your teams to that ideal stability versus velocity tradeoff for whatever service they are working on. And when you do this, that's how you can move fast and keep things stable and get benefits almost immediately. You can have multiple simultaneous processes, including your existing monolithic process, all tuned for that ideal stability velocity tradeoff with all your faster moving services leveraging the value you've already built in your more stable services without actually disrupting them. So this sounds great, right? So how do we get started with this? Starting from the right principles makes this thing a whole lot easier. But it is still a big shift in how people operate. This requires both organizational and technical changes. Organizational changes really impact the technology and how you use it. So I'm going to cover some of this in some ways you can make this easier on both fronts. So first of all, the organizational factors you need to consider. You really have to give a lot in order to get. You need a big emphasis on education, communication, and delegation. And if you look at the picture on the right, you can see why. On each of our small teams, our former specialists are exposed to every aspect of the development cycle. And so there's now a big learning curve. And also, because they're all specialists, nobody speaks the same language, so communication can be a challenge. And with this model, you end up delegating bigger parts of your business to much smaller teams. And that's sort of the point of this. But it can also be quite scary. But when you do this right, you get a whole lot. And you get better holistic systems. With learning comes personal growth and much higher job satisfaction. With better communication, that conflict that existed in your organization before turns into collaboration. And with delegation, of course, you get massive organizational scale. And the benefits when this is done well, they're pretty hard to overstate. So what's the best way to actually go about doing this? Well, you need to create self-sufficient autonomous software teams. Why self-sufficiency and autonomy? Well, if you have to work directly with other teams to get stuff done, then their process is interjected into yours. If you have autonomy, you can choose the best process to actually meet your goals. So to get there, there's two things to be aware of. First, you need to be aware of centralized specialist functions and try to eliminate them. Try to avoid centralized architecture teams and centralized operations teams. And don't get confused between a platform team that provides a platform and tooling as a service and a centralized operations team that's responsible for keeping all of your services running. These are two very, very different things. The second thing is to think of your teams like their own little spin-off companies. You probably already consume external services like Stripe and Twilio. Well, you can think of your microservice teams the same way. Pick a real urgent business problem that you wish you could buy instead of build, and then form an internal spin-off to build it. This sets up the right mindset for a lot of things. It helps you define your service because you're thinking mission statement. Who's the user? What are you trying to help them do? This is a way more effective way of defining services than big upfront design. It also helps with communication. You set up a customer relationship versus a co-worker relationship, and it helps you form the right team because you put people with all the different specialties you need on one small team. You can even do this with no fancy tooling or technology at all, but tooling can help a whole lot. That brings us to the technical implementation. Now, if you recall from earlier, this is sort of the rough picture of what most microservices tech stacks look like. It turns out this picture is actually missing something that's really, really important. And that's the people. People are the top-most layer of the control plane. When a service breaks, a person has to find out about it and fix it. And for a service to improve, a person needs to change it hopefully for the better, and the people factors involved have a big impact on the tooling as well. In fact, this gives us the goals for our tooling. We need to build tools that make our small autonomous teams productive. And right away, this tells us a few things. We need our tools to be good for the generalists we're trying to create. And a generalist UX is very, very different from a specialist UX. Think about all the specialist tooling out there for operations, things like monitoring, logging, analytics, deployment, all the L7 proxies and routers you can choose from. All of these tools have enormous surface area and flexibility, which is exactly what you want as a specialist. Specialists prioritize something general enough for any possible application a developer might conceivably write. And they don't care about the learning curve. Generalists, on the other hand, want something that's an easy starting point that works out of the box for their particular application. And that then lets them discover the more advanced features if and when they run into the need for them. Now, the other thing this tells us is that our teams are learning, so we need our tools to actually help with that. This means building on familiar concepts and providing training wheels. Safety faults so we don't get into trouble and great feedback. Every error message should really have a pointer to whatever it is you need to learn to solve the problem. And finally, because our teams are autonomous and each one can choose their own process, our tools need to be flexible enough to fit many different processes. So this is sort of the platonic ideal platform for doing service oriented development. And we've even managed to actually accomplish some of this. And Phil is actually going to talk now about what it is that we've built, what's worked well for us, and what hasn't. Yes. Thanks. So it may be nice to believe that Rayfee has just come up with all these things on his own by sitting in an armchair and drinking scotch and going, man, I can tell you all these people in this audience about this next week but it's actually a bunch of hard-learned lessons over the course of two years and some change from when we started building the backend at DataWire. So a little context, I joined DataWire at the end of 2015 as pretty much the first backend engineer to join the company. And at the time Rayfee was working in Boston. I was working in Boston. We had two other engineers, one down in New York and one in Europe at the time. They were kind of working on some other stuff and I came in to start building the backend. And we started building the backend. We knew we were going to need to build pretty fast for what we were trying to do and also we were going to have to build more than just a handful of services that we were going to need to put all the engineers on eventually. Now, when I came in there was no really expertise in the company for a platform engineer or an operations engineer, but if you're going to maintain an actual production service or you're going to actually try and build something that needs to be running in the cloud, you find quickly you do need someone to do that. So, we're out of luck. Anyways, good news for DataWire was that I had done some of that kind of stuff in my previous job. I was basically working as a platform engineer slash developer tooling and experience engineer for the previous gig. But once again, problem. I wasn't hired to actually work as the platform engineer or developer experience engineer at DataWire. I was hired to build the product in the backend. So, I had to come up with something that would be relatively easy for people we were going to bring into the company to work on and also the existing engineers to just pick up and go so that they didn't have to bother me all the time with their problems. So, I kind of go like, all right, well, the only way to really do this is build self-service automation around all this stuff. And so, I was like, all right, we're going to build out a self-service platform that anyone at the company can use. It'll be documented well. There's a lot of instructions to get a service started. And here's the tools they're going to be using and there's going to be these various URLs they go to. They can put things in here and they'll build out the CI pipeline, they'll build out the deployment pipeline, all that jazz. And so, there were some design considerations along with that that I had to keep in mind. So, DataWire has always been a polyglot programming shop. You know, Rayfee and the other engineers at the time were working mostly in Python. I came on board as a backend engineer and JVM experience, so I was writing Java stuff. We've always kind of had a, you know, openness about picking whatever tool works for you. So, you know, JavaScript tends to just creep into anyone's code base despite everyone's best attempts to avoid it. Go has become really popular and stuff like that. With any programming language set, you end up having a lot of tool chains, so not everyone is going to be using the same tool chain. If you're a Java developer, you're obviously familiar with, you know, Maven or Gradle, but Python folks don't use those tools at all and laugh at you if you suggest any kind of idea of using that stuff. The engineering org started out as a distributed by default and while we're now all co-located together, we still kind of follow a distributed team pattern, so some engineers come in early, some engineers work late, some engineers work morning, take the afternoon off and start working again late in the evening, so we have what would be a co-located team with distributed engineering personality, basically. When I came in, I was really the only person with a strong services development background and microservices development background, so we had to have a way that anyone who was going to start working on this could pick it up easily enough despite not having that background. And because we were building the actual backend for the product, there's obviously the concern that this is all mission critical software, so it has to work and it can't fail and needs to be able to be debugged and logging has to be available, all that jazz. So what did I do? The other way that I went about doing this was I wanted to do it right, because that's what everyone always says when they start out a technical project, I'm going to do it right, and so I started architecting the shit out of it with tools and I threw things like Jenkins and Spinnaker and Terraform and Ansible and EC2 and everything else that you can think of that's popular under the sun for doing this kind of stuff and, you know, I'm not saying these are bad tools, they're not bad tools, in fact, they're all really good tools in the right context and that's what I'm saying, I got the context wrong. I put all these tools together, glued them together, wrote up some documentation around it and thought I was going to get the uh oh, my GIFT didn't load. Oh no. Oh, there we go. So I thought I was going to get that and turns out, no, I got something more like this who is actually so this is not my actual co-worker but he looks sort of like this with long hair and he was looking at all these tools and going oh, what is this? This is insanity. You've got me looking at three different documentation sets jumping across different programming language for different configuration sets and like, what is this? Insanity, insanity, insanity. So all through 2015, 2016 we kind of talked about how Rafe said like prototyping started out initially quickly and then things got slow and they got really slow and change became really slow and painful as people had to start bringing in more of this tool chain into their own tool chain and trying to adapt and use it to actually deploy software and I, you know, put my fingers in my ears and kind of said no, no, no, it's not my fault it's your fault, you're not reading the documentation right or you're not doing it correctly which is completely the wrong response but when I finally sat down and started to look at the metrics for like how slow we were going versus like what we should be actually be able to do it was like ooh, this is really bad we got an issue here. So we kind of had to basically have a reset and fortunately we had some time at the end of 2016 and I set aside, you know, take a look at everything we were able to step back go hmm, maybe this is not correct so we took a step back, looked at it and went what do we want to do but in the meantime it's not what we really learned was none of this is about tools you can throw as many tools as you want at these problems and you're never going to fix people in process because people in process is actually the most important thing and they're not really, they're not trying to fix them when you say fixing you tend to imply they're broken, they're not broken they're just doing things in a way they want to do and they make them the most productive and at the end of the day people are trying to be productive they want to get their job done so they keep everyone else happy, the business is happy and it keeps going but you can't throw tools and the expectation I had started with was let's throw tools at the thing engineers are really finicky they don't like actually picking up tools that they don't believe in they don't believe the tool is going to work for them and makes their life harder they will not like that tool and so I started putting tools in front of their way that were slowing them down even though they knew what they were trying to do and they got angry and they got grumpy and they complained and it was a bad experience so all the tooling in the world it's not going to make your problems go away you got to think about the people in the process finally there's another really important point with tools every tool has basically a cost to it so every time you add a new piece of a new tool into your deployment pipeline or into your dev workflow you're actually asking for n number of permutations to come out of that how people are going to use it the docs say you're going to use it this particular way and you're going to find no one uses that particular way and three developers are going to come up with three or six or 20 different ways across all your projects that they're going to use it differently and you're going to have this explosion of combinations of tools and configurations and usages that don't actually match what the documentation says and more importantly readme's they're just bad so couple problems with readme's I know I do not write readme's particularly well so I have an expert blind spot when it comes to putting this stuff together I will jump from step one to step two and not realize that I have missed you know A, B and C of one that led to not having complete information the next person who reads it then also reads it and goes well I have no idea what's going on but it leads to arguments in the office where people are like well I followed your readme and you didn't actually document what you want me to do here because well I don't have that information that's one problem, another problem readme's are often treated as opinion rather than as the facts of the world so think of it like baking versus cooking in cooking you can take a lot of liberties you can look at a recipe and tweak it and do whatever you want to it developers tend to look at readme's and go step two already so I'll do it my own way because it will end up with the same result I think whereas the person who wrote the readme actually was treating it like a recipe for a cake and we all know when you start messing with recipes for cake like adding baking soda to yield better flavor you're not actually going to get what you expected it's going to taste terrible and you're going to have a lot of problems there so we went back to the drawing board at the end of 2016 we were like okay what do we actually want we really want a way for everyone to kind of have a common tool chain that can be exist on their laptop or on your co-worker's laptop or on your friend's laptop or your grandmother's laptop or in the CI system wherever you put it for whatever language for whatever environment it's going to be the same and we said we're going to do this initially by starting with Docker as our primitive so we will package everything into Docker and we will then figure out how to run Docker stuff so once you got into Docker the big question of time became like how do we run this and I'm sure everyone in this room is familiar obviously with things like miso and swarm but I had been following Kubernetes pretty closely since 2014 when I first heard about it and by the time of 2016 rolled around I was like okay this looks like it's ready for prime time this looks like where everyone's aligning and so I said alright we're going to use Kubernetes this seems like the place it's configuration driven it's pretty simple in terms of what it's actually trying to do has concepts that everyone will find pretty familiar we'll deploy it with Kubernetes and so our Kubernetes journey it began at this point in time and we started to take a look at the existing tools we had we scrapped most of them and said we're going to build tools based on two things either there's active pain in the process that you're trying to accomplish or there is some common pattern we've noticed across all of our repositories and all of our projects that are worth encoding in a tool and so like an example of active pain that everyone feels so a Docker build process where you're going to build say a Python application is really quick but it becomes a painful painful process for Java developers who have to recompile code because every Docker build effectively starts to be treated as if you were doing a clean build from scratch without the ability to have a Java you know build cache for your compiler so that's an active point of pain that we built tooling around another you know example for common pattern is well we noticed everyone has Docker files in the repository and everyone started having the Kubernetes manifest in a particular directory like alright we'll make that standard we'll make it so that whenever you put certain things in these directories you know the tool will just know what to do as it goes off and does the thing so for anyone who actually really wants to kind of go back to their own you know development workflow and start building like this your first step that you actually have to follow is provide fast repeatable builds for everyone so you know figure out what you want to do for your workflow for us it was we were going to have everyone works off of kind of a working at a git code repository and they create branches for the thing they package not only the old resulting artifact for what they're working on but the entire build tool chain is actually built into the Docker image which actually got rid of the problem of having to deal with readme's and having specific instructions for particular tool chains so like when they do an actual Docker build now and you're a Java developer the whole Docker build process will go through and it will end up with you know doing the compilation and producing the image with the jar built into it same step for a Python project or a JavaScript project or a Go project they don't need to know the individual steps going on the person who wrote the project and started it knows that but people who are working on it don't need to know those individual steps they're codified into the Docker build process so it gets rid of that whole problem of well this was a this was actually a baking recipe and you treated it like a cooking recipe and when I said I had to work for everyone I meant everyone so we made sure we weren't forgetting about people who were working with you know different tool chains it's imperative that you have a tool that works across your entire organization for any type of situation because you know everyone eventually becomes a polyglot shop whether they actually believe it or not you're either through acquisition or through some other mechanism are going to end up with services written languages that you didn't expect so you need this process to be resilient to change for in your organization as people come and go so that one was to go anyways so step two you have to provide fast self-service deploy as well so a data wire when we're actually building code we have the ability to either deploy into our shared cluster as pre-commit or post-commit and so that means when you're working on a branch you can have an isolated environment for your code that another person working on another branch is completely isolated from them so like Rayfee will often be working on a particular feature of one of the project I'll be working on another particular feature of that project and I can be running the same code I'm sorry I can be running the same project in the same cluster and none of us will ever touch each other but it will still be in the same cluster so we only have to manage a single cluster in total so parallel isolated deployments throughout your organization so to recap for step one and two codify and put your entire build process into a Docker image make it so that it actually does the entire build to get what you want to do and that's going to be your runnable artifact and then you use Kubernetes to actually do the deployments we found as you know we did these things we started to codify them into tools so like we've wrapped Docker build now in our own tool called forge which does the Docker build so that it can maintain compiler caches and stuff like that it can also do things like shortcut building an image if it knows that the image doesn't bother doing that it just looks at the commit history and goes this shop the entire repository is the same we're not going to bother building a new image for you there's nothing magical at any point in time developers when they need to deviate from the logic we've prescribed it's just a code change to the tool itself and if there's worst case they can just pull the tool out we're not doing anything that fundamentally alters their development process it's just a glue code that everyone ends up building no matter what but there is really a third and the most important green here it's step three you have to make it easy to reach the change to code and we do this with Envoy so we use Envoy and it's our basically entry point to our cluster and we put a thing called ambassador which is this thing that listen to changes on annotations that we put on the service manifests that are applied into Kubernetes so basically a typical setup in a branch at data wire for a particular service they will annotate the service so that they get host based routing that has the branch name in it goes through ambassador which is an API gateway which handles things like off which then goes to Envoy Envoy routes it to the correct backend service so it makes it super easy for someone to actually have like dev awesomefeature.datawire.io or plimbardi.datawire.io where they put whatever they want spin up projects in general because you only have to think about this little block of annotation code that you put into your service manifest and off you are to the races is it perfect? No, it's really not perfect we still got a long way to go but it's so much better than we were we really started putting this into effect earlier this year and like the speed at which we're actually able to build our services now is monumentally faster it's hard to actually quantify how fast it is now compared to what it is but we still have a long way to go so there's a bunch of things we are working on actively going into 2018 that we're going to improve these things really for us self service bootstrapping I shouldn't have put self service there, anyone can start but we'd like to have some easy way for developers who come onto the company to easily start services that doesn't involve usually taking an existing service and ripping out the code from it so we haven't really come out with a good cookie cutter mechanism that we like for me I'm still continually hounded for self service stateful infrastructure so I'm usually the only person who knows how to get that working and so I'd really like to figure out how people are going to deploy their own databases and their own queue servers and all that kind of stuff without having to bother me and really monitoring and logging we have all the infrastructure in place for collecting logs we have all the infrastructure in place for doing monitoring it's mostly we haven't come up with a good pattern and good UX yet for everyone to be able to consume that data in a meaningful way anyways, Rafe here is going to take it away for the rest here and talk to you about like a summary and can I conclude this thank you yes so looking back after kind of reliving some of the past a couple years thanks to Phil if I could give myself some advice from the future I would really tell myself that Kubernetes and Envoy really do have the bulk of the capabilities you need to get started with microservices but capabilities really aren't enough UX matters a lot and I mean a lot and when it comes to UX the people and process factors they fundamentally drive the UX and building this ideal UX it's actually an enormous undertaking and so it's really helpful to understand some of these principles and take an incremental approach with them in mind so that's our talk thanks for coming we're happy to answer any questions now if you want to talk to us more later you can visit us at our booth S58 if you're interested in using some of the tools we've built for your own platform they are free and open source we've published some hands-on guides at the link above send us a tweet email us and we're happy to chat