 I want to do something quite a bit different than probably every other talk here and probably the majority of the talks you've ever been to before. Rather than focus on what I think you should do or rather than focus on teaching you new technology or explaining something, I'm going to do the opposite. I'm going to tell you everything that you should not do. And I'm going to tell you this story based on a lesson from personal experience, and something that I did, that I worked with the team on. And we got a lot right, don't get me wrong, but we got a lot, a lot wrong. And I want to share you today the story of what we got wrong, how we fixed it, what we learned, and hopefully a couple takeaways on how to prevent you making those same style mistakes. So this journey is going to be assisted by my cute little puppy, Ada Douglas. She's going to be in some of the interludes, but really I want to take you on the background around what the team was, what the technological and business problems that we were trying to solve were. The architecture involved what we chose, why we chose what we chose, how we actually got this platform up and running from a local dev environment, getting it into production, dealing with the changes that eventually came from product and from bugs and things like that, the refactoring we needed to do, and kind of finish it off with some of the lessons that we learned. To give you some background, when I joined the team, it was nine developers. They had just recently grown from five developers in the year beforehand. The original five were there for about four years beforehand and built the entire monolithic application that existed. And so we grew to 23 by the time this story starts. We went from five to nine in about a year and then nine to 23 in about nine months. We recently closed around a funding of aggressive growth. And so we had a decent size team. The team had a fair bit of experience range, everything from quite junior to some of the most senior in the industry that we could find, very, very experienced developers. About half the team was PHP focused. A couple of those people had experience with Golang and Scala, et cetera. We had a number of front-end engineers, mostly focused on JavaScript. React was the new thing at the time and we had hired a very, very good React team and built a React team under us. We also had data engineers as well as DevOps as well as test engineers. DevOps is kind of a loaded term, operational engineers. A couple of people focused on infrastructure and an SRE to help actually run the existing platform that we had. And so it's a very new team. The team existed for less than three months as the whole before we kicked off this project. The background, the legacy system. So I mentioned we had a monolithic application. This monolithic application had bits of Zend Framework 1 code igniter, symphony cake, and a bunch of PHP 4 and pair code in it. And to be fair, it worked pretty well. It powered a multimillion-dollar-a-year business for years. But it was kind of a nightmare to work with. We started doing quite a bit of refactoring. As you see, symphony, we started to refactor prior to me coming in to try to condense onto one framework. But it was just such a large application, about half a million lines of code that that's not an easy thing to do. The biggest problem, though, is at the data layer. The application, as you see here, had about 600 database tables. This is not a complicated application. 600 tables for something that you could probably do in about 40 should give you an idea of kind of the level of complexity in here. And this 1,000 cron job number, that is not an exaggeration. I believe the exact number was like 1,150 or something like that. And the scariest part was the application was designed over many, many years. And so reporting was a really important area of our application that we did that we shipped to our clients. And multiple parts of the application would write to the reporting area, the reporting tables, except they would all write slightly differently, partially because of bugs, partially because of differences in design over the years. And so the stats were wrong because the database was updated in a weird way. And so rather than fixing and commonizing and putting a library as most developers would do, the solution was, well, let's just install a cron job to fix this. And so we had cron jobs that ran every five minutes, every 20 minutes, every 15 minutes to clean and rectify the data. It brings a whole new perspective to the term eventually consistent. So this was actually a huge, huge problem and really slowed everything down. Working with this system was a nightmare. And so when I joined, when I was hired, they were just ending a nine month product freeze. And what that means is no product work happened at the company for nine months. They focused purely on refactoring, on moving the infrastructure away from the dedicated hosts that they were on beforehand, on into AWS, on a couple security things that had to be taken care of for compliance reasons, and on trying to make this system easier to work with. At the same time, sales did not stop. So we had increasingly, increasing scale requirements. We had new customers that were closing with 10 times the number of users as our price biggest customer, and the product wasn't support to scale for that. And so we needed to ship faster. We needed to ship better code. We needed to ship products that were stable, and the business was really demanding. And so we kind of hit this precipice point where we had to make a decision. Do we continue trying to refactor this? Do we continue trying to solve the application's problems in the code base that exists? Or do we throw away the existing platform and build a new one? And rewrite it? And so the way we approach this question, I think, is really, really interesting. And so I want to take a little tangent to kind of go into this process a little bit, because the answer we got to was quite unexpected. When we faced this question, we walked into a room. I say a room, I'm talking about maybe a two meter by two meter room with four whiteboard walls, with whiteboards on all four walls. We walked in, the head of product, head of design, myself and two other engineers cramped into this tiny little room. And we got out of our pens and we started listing every core assumption that we could think of of the product. Now when I say a core assumption, think of like your user's table. When you have a user, you probably have an email address associated with that user. Common email on the user table. You probably have a unique index on that email address. That's an assumption. You're assuming that an email uniquely identifies a user. If you wanted in your application to change that assumption, if you wanted to say, I want to have two users be able to have the same email address, how difficult would that be for you? Probably not terribly difficult. Yeah, you'd have to change some of your authentication logic if you're using emails for authentication. You probably have to change a little bit around like password resets and account management and stuff like that. But for the most part, that should be a pretty straightforward refactor. In our case, about 60% of the system relied on the fact that emails were unique. Some of the reporting pieces looked at email rather than user ID, and trying to tease that apart and refactor it to use user ID would have been incredibly, incredibly difficult. So what we did is we filled all of the whiteboards with everything that we could think of that the product assumed. Okay, that took us probably about three hours to do. So we filled out all of this information. Then we took a different color marker and brought the rest of the product managers in and said, this assumption here, the email is unique within the system per user. Do you want that to change within the next year? And if they said yes, we drew a line through it. If they said no, we left it in the color. If they said, well, maybe eventually we left it alone. We did this process for every single one of them, and we were left with three assumptions that we did not want to change. Out of four walls worth of assumptions, we basically wanted to change everything about the product without changing it. So that led us to the decision to do neither. We don't want to refactor because that's going to be a nightmare, trying to change everything about the program. Even if it was a really clean application, trying to refactor it that much would have been an insane challenge. But we also don't want to rebuild because rebuilding would mean rewriting the existing application. And so what we chose was to build a second product. At the highest of high levels, it did the same thing. It solved the same stupidly high-level business problem. But at every layer below, the way it did it, the approach that it took, how it solved it from the user perspective, how it was administrated, how it was maintained, the technology, everything else was different. So we basically went to the board, we pitched this, let's build the second product, let's invest as much as we can into this. Once it's up high enough, we can migrate our existing customers over and throw away the existing one. And so this is what we did. So everything from this point forward is going to be me talking about that new product that we decided to build. And this is literally a green-field project. We walked in with a team of about 23 at the start, about 35 towards the end of what I'm going to talk about here. And so we needed to start off from scratch. And how do you start off from scratch? Well, there was this new thing. This is in the 2014 era. There was this new thing called, sorry, 2015, a new thing called microservices, which was all the rage. We had experience building one of these microservices. So we sat down and said, what if we build the entire application off of microservices? What would that look like? And so we started drawing diagrams. And we started building entity relations. And we started doing design sprints. And we started doing event storming. And we came up with a plan for the UI, the architecture, as well as the application design. This is actually pretty close to one of the initial stabs at the architecture that we took and what we wound up finally building. At this point, actually, I should caveat some details have been changed from the reality. Some things I'm going to exaggerate a little bit for effect. And some things aren't going to be identical to what we did. But it is still, I think, in spirit good enough. And it's also some of the things that have been tweaked are to make the point across better. So if anybody is watching from the company and from the team, I hope that the rest of this aligns. So we have this architecture. I think the most interesting initial part was we went API-first by design. So we had a front-end server and a single-page front-end app. Actually, a series of them. I think there were about three front-end apps for different usages that were based on React. So we had a front-end server that ran Node, that pre-rendered React and served it up to the client, worked absolutely phenomenally, was really, really stable, everything worked great, so I'm not going to talk about the front-end at all from this point forward. It was an amazing part of what we built. But this talk is about things that we did wrong, not things that we did right, and so let's skip that. So the first thing I want to focus on here is this big green thing in the center, this thing called an API gateway. To put time and perspective here, when we decided to go with this architecture, was exactly one day before Amazon launched its AWS gateway, API gateway, to private beta. And our AWS representative reached out to us the day after we made the decision to go to this route and said, hey, we have this new feature that would you like to be a private beta member? And so we stopped and we looked at it and we had been researching other technologies and we decided not to go with AWS gateway, but that should put the rest of the story into context because we ran into that problem a lot, where we made decisions because we had to make a decision and then very shortly after a better alternative came up that we should have probably taken. So we went with Tyche. It's an open source tool written in Golang. It uses MongoDB behind the hood, which it worked phenomenally well, but take what you wish with Mongo, no offense to anybody. So what Tyche is, is an API gateway. It terminates and separates external requests from internal requests. So in the public domain, anything outside of your firewall hits Tyche. Tyche handles OAuth for you. So your OAuth 2 termination, all of your secrets, all of your key management is all done by Tyche. Nothing in your application needs to even know about OAuth. It's completely agnostic to it. It doesn't have to know about rate limits. It doesn't have to know about anything to do with security of the request. It terminates everything publicly and then just simply adds a header that says, I validated that through other services, that this user is user IDX, that they have a valid token, that here's the exploration time of the token in case you need it, et cetera, et cetera, et cetera. And so your application can focus on the application and doesn't have to worry about these things. If you use Symphony or Zend or Laravel, this is very, very similar to the middleware pattern applied at the services layer. This is a middleware that's only for public requests, for external requests. Internally, every service used RPCs and used these custom headers to identify authentication information, et cetera, which made it really, really easy to iterate the private servers behind the scene very, very fast. We were deploying these servers in production once we got everything up and running, probably about four or five times a day, and changing APIs potentially as often, whereas because of Tyche, we were able to keep the public API, the public REST API, quite slow and quite stable. Phenomenal design pattern, it worked really, really well for us. This is one of those type of decisions that I think that we got really, really right early on. And if I was building a system of this scale again, I would absolutely use an API gateway of some sort. We had fantastic luck with Tyche, and Tyche worked well, but I know there's plenty of other alternatives. So that separates outside the firewall to inside the firewall, and note that the front-end server is outside the firewall. And so that means nothing about the front-end depended upon anything private about the back-end. We were developing against our public APIs. So when we released our API to clients to build off of as a third-party API, we knew that it worked, because we had the front-end that tested it. We knew it solved the business problems because we had an active team of developers who was developing against it, and we knew it solved the use cases completely because we built our application off of it. So aside from some really, really narrow use cases, we were pretty confident in our public API. The next set of layers is the backbone of this thing. The message bus, or RabbitMQ, as we chose. One of the problems we ran into, we used RabbitMQ as a pseudo-event-sourcing database, meaning every change to the application, so if you update your username, gets emitted as an event from the user service. That event goes back into the RabbitMQ. Other services can listen to it and adjust their state effectively, and so we had only one set of shared state, which was RabbitMQ. I should note this, we chose RabbitMQ specifically because at the time, it had the best bindings in PHP. Today, if I were to do this again, I would choose Kafka in a heartbeat just because Kafka gets one of the biggest issues we ran into with Rabbit. Kafka makes non-existent, which is ordering of messages. Rabbit has the subtle property, and it's not subtle, it's actually blared in big, bold letters, but we just kind of ignored it for some reason, that you can't trust message order. And so we had this archive service that would take the messages that every single message that went on the RabbitMQ and put it into an archive. But when you have unordered messages and you have an archive of unordered messages and the messages are changes to the application, what happens if those messages get reversed? Yeah, that caused us a metric ton of pain. And the biggest reason for that was these were not directly queryable by any of the applications. So we had this event source database, if you will, in this archive, but no applications actually bound to the event store. And so if you've studied event sourcing or if you've gone to any of the talks, that's one of the most critical aspects to keeping an event source system consistent. And so we got that really wrong and paid quite a bit of a price for that. But the thing it did allow us to do was keep every application complete, every service completely separated. Every service had its own caching system, had its own database, and was really its own isolated pod and it used Rabbit to communicate. And we communicated in two different types of methods. One is an event which is kind of those application changes that I talked about. The user updated their username, username change, event comes out. And then RPCs. We did quite a bit of work with video and video transcoding, and so one of the common operations was a user uploads a video. We need to transcode it into a bunch of different things. We need to extract closed captions, et cetera, all of those processes. They all take a lot of time. And so you would send off an RPC request to RabbitMQ to tell it, hey, do the transcoding. And then at some point in the future, RabbitMQ wants message back saying the transcoding is done, do the next thing. That RPC system was actually quite reliable and was a phenomenal use of the platform. So the final layer that we'll spend a good bit of time talking about today is the service layer. The names here, domain service, async service and meta service are stuff that we came up with to try to distinguish the way applications behaved. We had an internal platform that we used. I wouldn't call it a framework. I would call it a set of composer files that brought in and wired together applications differently for each one of these services. So, for example, async services need to talk to RabbitMQ a lot, and so we had PHP services that were built on this, and so we had a base package for PHP services that listened to Rabbit and pulled it together. So they weren't really framework, although some of them did use open-source frameworks. The difference between these are kind of what purpose they serve inside of the platform. The domain service is named domain because it's a domain entity. If you've looked at domain-driven design, you look at the different entities that you have, we basically took every domain entity and stuck it in its own service, give or take. All communicate over HTTP, and they have their own persistence. I mentioned we do a lot of batch jobs, async services. They only communicate over RPC. The only HTTP endpoint that they open up is a health check. And then the meta-services kind of fill the gaps between the two, and I'll give you a concrete example of what that looks like. We have a version of the domain that we were working with. You have users, you have assignments and a history object, and then lessons with content and with assets. So assets would be video files or images or anything like that. Content has many assets, a lesson has many pieces of content, an assignment has many lessons, and then we keep track of what lessons the user sees. Not terribly complex model. The reality was a little bit different, but what we did was we built a service around each one of these primary entities. And so we turned it into a user service, an assignment service, a history service, a lesson service, a content service, and an asset service. That's a lot of services. And especially when you throw that out to the rest of the application and the rest of the design, we had probably about 50 services within a year, and it was growing quickly quite high. So the question comes, let's say you want to show an assignment on the front end. So the assignment, remember, has a set of users that it's associated to, has a set of lessons, et cetera. On the front end, in an administrative portal, how would you render that? Well, the naive way would be to call for the assignment and then loop over each one of the other objects and then call and have, you know, a couple hundred thousand HTTP requests from the front end in order to render the page. We decided not to do that. Rightfully so. We also decided not to go with something like GraphQL to do that for us. Because, two reasons. One, we saw the... that, basically, GraphQL would just shift the problem to inside of our firewall, rather than outside of our firewall. And more importantly, the assignment concept is a concept that comes over all of this. So when we're shipping a public API, I wanted to show that concept in that API. So we wanted to have an API that actually returned the domain information for all of this. And so what we built was a meta-service. The meta-service knows how to talk to all of these services, but it's also listening for changes on RabbitMQ. So anytime you update one of these things, it updates its local state. So requesting from the front end only has to hit this one service, which 99.9% of the time just fetches things from cache and pushes out of the cache. So it wound up speeding up the application drastically, and it wound up making it really, really simple to work with from an API layer. The problem comes... What if we change the question? What if instead you want to get a list of lessons ordered by content author name? I mean, naively, you would get every single user, get every single lesson, do the join and then order it. We could create a lesson sorting service, meta-service, but that would be really weird. We could do a whole bunch of other things, but every single solution is dirty and is weird. The actual solution we eventually deployed was Elasticsearch. So we built a search service that would do these type of highly aggregate queries for you. So that's kind of the architecture of what we designed and started to build. So let's talk about the infrastructure for a bit. At the time, there was a project that was just announced but was not actually released yet, called Kubernetes. And so we actually couldn't run Kubernetes because there was no Kubernetes to run. And so we started to look at... We knew we wanted to do containers. Primarily because we have Go, we have PHP, we've got Scala, we've got Spark, we've got Hadoop that we're all running in this infrastructure. So we wanted some kind of a job worker to manage all of this. I mentioned Spark. We had a production Spark infrastructure on the prior application. So we had a good bit of experience running a product called Apache Mesos. Apache Mesos is like Kubernetes but not for containers for abstract jobs. So it's a layer down. So Mesos runs on all of the different servers in hybrid environments, so you can give it servers in multiple different data centers, etc. And then you give it jobs. And it figures out how to run those jobs and handles that for you. And there's also another open source software called Marathon which basically binds Mesos into Docker. So we ran Marathon on top of Mesos and we could have one cluster that ran all of our data jobs as well as all of our HTTP jobs. And so what we wound up having was about, I think it was about 40 to 50 servers that could deploy containers to, and basically we just had an API. So when I wanted to deploy a new service, I could just call an API and the new container was spun up. I told it I want four instances, Mesos figures out the rest. We coded in rules that if there's multiple instances of a service, they must span data centers because, again, Mesos is smart. It knows about where the servers are and it knows how to orchestrate everything. And so this is kind of what we had at the network layer. We've got a whole bunch of servers that Mesos is running. Marathon is orchestrating how we route networks. We used ELBs because ELB, Amazon ELBs are insanely reliable, but they're also slow to update. So we used HA Proxy, which could update in the span of milliseconds. And so we built this tiered system to be as reliable yet as performant as possible. And so a request would come in. It would hit the ELB. The ELB would call HA Proxy. HA Proxy would then call Tyche, which is running in a container just like any other service. And then Tyche would do OAuth, it would do rate limiting, it would do all that fun stuff, and it would figure out this is a call for an internal service. So it calls down to that domain name, which then hits that load balancer, which then hits that HA Proxy instance, which then calls up into the service. Adds maybe 10 milliseconds of latency, yet we've tried to kill it. We literally went in and killed Mesos nodes. We killed HA Proxy nodes while load balancing it, and it was surprisingly robust. Internal requests work the same way, except they skip a step. So if A wants to call a different service, it just calls that service, goes through the ELB, and comes back into the service. So what this means is, if we want statistics on what's going on inside of our network, we only need to look at the load balancers. The load balancers will tell us every piece of communication because no one service can talk directly to another service. So we knew what was going on in the stack. Speaking of knowing what's going on in the stack, let's talk about monitoring and logging. We have Mesos, a host, that's running a series of containers. Well, we're running Docker, and so we want to collect the log, so we put logspout in each host, and logspout pushed off to a Kafka queue, then pushed off into Datadog and S3. Logs were available instantly for every single service, and you didn't have to do a thing about it. It was all automated for you. Even which service it came from was automated. We had the Docker container information and the host information, so you could correlate this. We also wanted to collect metrics, so we deployed a statsD collector on every host. So every container is talking to a local deploy of statsD, and so we got our metrics from our application. And finally, we wanted some tracing information, and what that means is, if one service calls another service, we want to know that, and we want to be able to see and debug, well, this request failed, what was the original request, and be able to trace backwards, or this service is slow, so let's see what else, why it was slow, what was slow in the query path, so we used something called Zipkin. It was at the time really difficult to work with because it was right in the transition period before a major fork, and so the open-source version of Zipkin was difficult to run and wasn't really stable yet, but the closed-source but available package was also really difficult to run, so we spent a lot of time trying to get Zipkin up and running, and towards the end, we actually wound up getting it up and running, and it was working out quite well, but that's something I wish we would have done quite a bit earlier. The final piece of the infrastructure puzzle is the services.json, or service.json config file, so I mentioned we had like 50 services. Every service has its own docker, oh, sorry, GitHub repository, and in the root of that GitHub repository, we defined a service.json file, defined what the name of the service was, what other services this depended upon, so we knew what to spin up, what type of data stores it needs. Note there's no authentication information because all of that was generated by deployment, so we would automatically know that you needed a Postgres database, so we'd spin up a Postgres instance, create a database name, user table, et cetera, and set all the environment variables 100% automated so humans don't actually even have access to that information. Define your health checks, set the service in type, the public endpoints that you had, and the deployment variables, and so towards the end, the goal was if you wanted to deploy a new service, all you have to do is create this file and then tell our central deployment system about that GitHub repo, and it did the rest. It deployed, it spun up all the instances, every time you pushed to master, it knew that it needed to rebuild a container, it handled CI for you. It was really quite doing everything. So with all of this said, this entire infrastructure section, does any of this look familiar? I mentioned at the start there was this new project called Kubernetes that came out in the beginning right when we were starting this program, but wasn't actually open source yet, it was just announced. This is basically Kubernetes. We built a large part of Kubernetes and a large part of the architecture decisions we made by luck, by knowledge of the team, et cetera, wound up being almost identical to the way Kubernetes actually runs. And so all of this stuff that I just showed you that we designed, built, and worked on, we could have thrown away and just replaced with one spin up today. At the point in time, again, we were just a little bit too early to use Kubernetes, but today, just use Kubernetes and forget about that. It doesn't mean you don't need an ops team. It doesn't mean you don't need automation. It just means the infrastructure is handled for you and you need to maintain that piece. So let's talk about local dev. I mentioned we kicked off this whole process. We had a three month window where we were looking to build the minimum viable product. So we built the first, I think it was about 10 services, built the infrastructure, got everything up and running, and spent the last month of that three months trying to get it running in production. And what we realized was that's going to be a nightmare. We knew that's going to be a nightmare. So let's try to make a local dev experience that's going to mimic what we're expecting in production. What we intended and what we actually did build was a command line tool that you would run, so you would check out that repository. You'd run this command line tool. It would look for the service definition and it would figure out, well, I need to spin up a database. I need to spin up the dependent services, et cetera. And so it basically generated a Docker compose file and then ran Docker compose up and just spun up everything locally. It all was designed to work frictionless. It was designed that every time you spun up these services, it would run your migrations for you. So when you had other services that you depended upon that were moving quickly, they stayed up to date. Their databases stayed up to date. The data in them stayed up to date, et cetera. And it was all designed to wire everything together the same way we wanted to wire together prod. And what actually happened was that was not used. Every developer ran their own service natively. They ran Apache or Nginx on their local machine, PHP, FPM, built the service. If they needed to talk to another service nine times out of 10, they just like kind of faked it and made their own little end points that they could call or mocked out that portion of the application so that they didn't call the other services. They never ran. Sometimes they would actually spin up the service they depended upon on their local machine. They would talk to the dev of that other service to help me get this thing up and running. And what happened was they got stale. So service A depended upon service B. So I get it up and running on my machine today. Next week when I go to ship my service, I'm depending on an old version sold that when I try to deploy, I need to go fix all of those inconsistencies between the two. It was a nightmare. It was an absolute nightmare. And so we started, we stopped and we thought about, well, why? Why is no one using the local dev system? And it turns out there were a couple key reasons. It was excruciatingly slow. Spinning up because it had to configure sometimes up to like five to 10 other containers and other services and build them all, including database containers and everything, the spin-up time was about 15 minutes. From the time that you said, give me a new dev environment to the time that your new dev environment is ready to work, about 15 minutes, my code's compiling, yay. It was difficult to use. It frequently broke or frequently got into inconsistent states that you had to manually just kill it all and rebuild it. And it was just kind of really just unreliable. But that's not why it wasn't used. It wasn't used because it was someone else's problem. A person on the team was picked to build the local dev experience and to maintain that local dev experience. And they did that. But all the rest of the devs were told to just use this thing, but were never given any ownership over it. So they never had any investment to fix any of the problems. When something went wrong, well, I got to get my work done. So, hey, you, go fix the problem. I'm going to go work over here locally. And so it just never got adoption because it was always someone else's problem. That's, I think, the biggest mistake from the infrastructure side and from the ops side that we made was not making local dev every single developer's problem. Making it someone else's tool that they maintained really, really bit us hard. And so trying to get this thing up and running was quite a bit of fun and quite a bit of a challenge. I mentioned we got the first set of services up in two months out of that three-month window. Getting the first production request took another month. Think about that. We're code complete. Everything works locally. Getting it up and running in a production infrastructure took 30 days. Ouch. It failed for a whole bunch of reasons, some of which are probably textbook you're looking at going up. Yep, been there, been there. Some of which were due to the failures of local dev and consistent local dev experiences. And some of it was just bad management on my part in the part of my team where we built services to idealize behavior and we didn't actually have real metal to try this stuff on to learn how it works. So there were some just learning kinks that we had to get out. I mentioned services were built in isolation, but the biggest problem was we got to code complete at month two, but we still had more product to build. So all of the product engineers were off building new services while Infra was finishing and finalizing the infrastructure, and by the time we ran into problems, devs were off of those services for a week or two. And that context switching back to, oh, I've got to fix this bug. Nope, I've got to fix that. It cost us a ludicrous amount of time. And the APIs were evolving rapidly, which led to all sorts of problems trying to maintain it. I'm not going to read every one of these. This is kind of like a dump of some of the big takeaways from getting it up and running in prod. Take a look at the slides afterwards or take a picture here. The one I want to really call out is this guy. RabbitMQ events were missing data or structure. So I mentioned we used RabbitMQ as an event source. Well, what happens if your event source, all of your changes that the entire application relies upon are missing data? That's a pretty critical issue. The problem was our services were all tested incredibly well. We had over 100% coverage, I say over 100% coverage, because we had 100% line coverage, but we actually had very, very reviewed unit tests for every single service. The services were insanely well tested from a unit test perspective. So we tested when an HTTP request comes in that it triggers the change that we expected to the model layer. We tested that when the model layer changes correctly, we get the change persisted to the database correctly. We know that when that model layer changes, we get an event that gets admitted to the event store correctly with the correct information. All of that was tested. What we didn't test was all of that put together. We had no smoke test that when an HTTP request came in, it did all of those things together. And what happened was we had a wiring error where in a couple of services, the wiring from the request coming into the event going out was just ever so slightly misconfigured. And so the events were still going out, they just had no data. And so user changes were emitting events that were empty. How did we find this out? Well, I would love to sit here and say, well, we had instrumentation, we had a service that identified this and our error reporting was great. We found out because we went to try to replay the events to restore, to build up a new service and we're starting to get errors building that new service after about a month of uptime and went, oh, crud, this is not good. We lost the production data. We did manage to resolve that. We did rebuild, but it cost us a pretty significant amount of time. The lesson is always smoke test in addition. So I'm not saying smoke test every path, but make sure that the application is wired together correctly. Unit tests are not enough, especially when doing something as complicated as a distributed system. The final major point that I want to cover is dealing with change. So we have this thing out in prod, we're serving customer production traffic and product changes come in. Could be because we've got client feedback that says the thing that we built wasn't exactly what we need. That happens, that's good, we should do that, we should respond to that. It partially was executive stakeholders coming in saying I want X or I think we need to do Y, which is fine and natural. I don't think any of those were egregious or anything, it's just natural course of progress. And sometimes we just didn't get it right the first time and we needed to iterate a couple of times. All 100% natural. Change is natural to software. Let's take an example of a real change. Let's say this is a hierarchy of data. And let's just say each one of these is a service. It wasn't an R case, but let's just do this for, for example purposes. A program has many topics, each topic has many lessons, each lesson has many cards and each card has many assets. What happens if product asks you to add in a new layer to this? And when I say a new layer, I'm not talking about like these are categories that are just a taxonomy, like these are each domain objects with actual business rules built against them. How would you refactor that in a monolith? That's actually probably pretty easy, you just add some of the new logic and then do a migration and you're done. What about removing one? That's probably even easier in a monolithic application context. But when you talk about services, how would you refactor this? Well, it's not easy. The refactoring actually was one of the most difficult parts because we had to do three stage refactors. Stage one was, well actually four stages. Stage one was figure out what the change was that we wanted to do. So we wanted to change these four APIs and create this fifth API. We needed to go to every service that depended upon the one we wanted to change, make it so that that service could accept either version, either the old version or the new version, deploy all of those, make sure that's up and running stably, then deploy the new version of the target service, and then finally go back and clean up and remove all of those conditional logics to allow that the old format. What would have been a simple one day change that was just involved in ETL and your application and a migration became major, major coordinated surgery that required almost the entire team. That I think was the biggest failure of the architecture was it made it really easy to respond to change the micro level. It made it nearly impossible to respond to change at the macro level, at the larger level. So wrapping up, I'm going to summarize a couple of the key takeaways. First off, I mentioned that we designed our services really small at the entity level. We didn't appreciate how unreliable service calls are. I mean, you say it out loud and it's, yeah, we know HTTP is unreliable, but how unreliable is it? Let me ask a question. How often do you expect a method call to fail? I'm not saying what the method does. I'm saying the method call itself to not execute the code in that method. One in maybe a billion requests? Sorry, one in a billion method calls? No, they're more reliable than that. One in whatever that large number is? Incidentally, this large number actually does have a root. If you look at Symphony and you take a like a Hello World style, sorry, not a Hello World, a real world Symphony application and just look at the number of methods it executes in one request and then multiply that by 99.999% uptime, you wind up getting roughly that number. And so if the reason for a Symphony application failing was a method call failing and not like HTTP request or database issue or anything like that, it's something about that method and reliability. In practice, though, you don't think about it. We don't think about it. When we do object-oriented design, when you talk about solid, when you talk about design patterns, every single one of those things, with the exception of like the proxy design pattern, is based on the fundamental assumption that method calls are 100% reliable. I guarantee you, none of you, when writing code ever think about, well, what if this method doesn't actually enter the code that you do? Yet with a service call, that is absolutely going to happen over and over and over again. Even if you're able to do five nines of uptime, that's one of every 100,000 requests is going to fail. And so the biggest error that we made was how big of a service should you make? We went for the domain size because we were really good at domain design and we were really good at object-oriented design to treat microservices and service-oriented architecture as objects. And the reality is, that was a horrible idea. And so what I would have done and what we actually refactored to at the very, very end, and after I had left, was this. So have a couple of services where the domain boundaries are very, very narrow and very, very defined. And that's it. The lesson service handles everything about lessons. One of the guys on our team coined the term microlith and used the microlith expression to define this. Start your services as microliths, as large as possible, because it is far easier to split apart a large system than it is to stitch two dynamic systems together, especially when there's a lot of dependencies and a lot of network traversal between them. Build it large, and then when you need to split or refactor something out, do that. It's far easier. My number one recommendation. Do not do microservices. In reality, there is a caveat here. I'm not suggesting microservices are a bad pattern. Obviously, there's lots of very large teams and very successful teams who do it very, very well, but do not do it, period, unless you have a dedicated team to tooling and infrastructure. And I'm not just talking about production tooling. I am talking about developer tooling. If it is not easy for your build systems to work and for you to run tests and testing infrastructure and do your own local dev in a 100% automated fashion, you will have a bad time at scale with this type of a complexity system. I mentioned start with big services. It's easier to split a large instance than it is to stitch together two. Automate everything. Deployment is the obvious one, but I'm talking far lower than that. Your spin-up, getting a new service into your infrastructure, automate your deployment, automate your migration. Are you using a migration tool that's dependent upon your service or are you using one tool for all of your services? Automate your backups. State restoration is a huge one. If you want to reproduce a production state locally without downloading production data, how do you do that? How can you set up all of the objects and all of the services in the right state? It's actually a pretty difficult problem. Don't plan for failure. Live it. This is the biggest takeaway. I mentioned don't do microservices. I still use serverless architectures and I still use services in people using services. The biggest change to how I approach writing that code is I write the failure case first. So when I have a service that's going to call a second service, the first thing I write and I test and I deploy is what happens if that service was missing. How am I going to recover? How am I going to gracefully handle that issue? Get that in, test it, make sure that's solid, and then handle the happy path. Because if you focus on the happy path first, you're going to code yourself into a corner and you're going to have a bad time at some point in time. The final one, which I didn't really touch too much here, is around service level objectives. If you haven't read the Google SRE book, I would highly, highly recommend it. It's a phenomenal book. But one of the big things that they focus on is business service level objectives. What I mean by that is, for every service we track a whole bunch of metrics. We tracked time to first byte. We tracked average response time. We tracked request per second, et cetera. But the business doesn't care about request per second to your user's table. They care about the number of successful authentications. They care about the ratio of authentication failures. They care about the number of lessons you're actually able to deliver. They care about all of these other metrics. The technical metrics are important. I'm not saying not to measure those. But think about what is the business case for the service? Why does the business care that the service exists define those objectives and get them into code? Get them into code early? Get them into your monitoring systems? And be sure that your service is operating to what the business needs of it, not what the geeky technical need. The biggest takeaway that I've got from this entire experience is what we do as software engineers and software developers as programmers. On one level is writing the code to solve a problem. But when you get beyond that, I think our number one job is managing complexity. We can write complexity into our source code in front of us. The 1998 way of doing things where you just have a single page spaghetti app. And the complexity of the system is really, really small. But the complexity of what you're looking at is quite high. You can refactor and use more object-oriented design where the object that you're looking at is quite simple, but at the expense of more complicated code elsewhere. You can use modern frameworks, or there was a controversial post about visual debt a couple years ago, where you just remove all type hints from your file. It makes it look simpler. And it does. It makes what you're looking at simpler. It makes that complexity go somewhere else in the application. And so that's, I think, the biggest thing is you can never eliminate complexity. You can only move it around or edit. And so when you're dealing with change, when you're dealing with systems and applications and software, be cognizant of where the complexity you're introducing goes and what trade-offs that you're exercising with respect to that complexity. I had a really good conversation last night with a couple people and there was a discussion about the difference in writing software for a library that has 3 million active users and writing a software for a team of 12 to maintain. Fundamentally, it's the same thing. We're writing software for both. But the complexity trade-offs are vastly different. You can deal with and get away with a lot more when you're dealing with a small team. So you have to understand your audience in the context within your running. And the final takeaway is it's far, far, far too easy. We got bit by this. We got the battle scars for it. To create a system that is so complicated that you cannot understand it. And if you can't understand it, how can you possibly hope to run it in production, to debug it, to maintain it, and to keep it alive and actually work with it quickly? Thank you. We have a few minutes for questions if there's any questions in the room. Did you ever end up refactoring to take on some of those technologies you kind of missed going out to the door? So did we wind up refactoring to check things? So things like Kafka, things like... Yeah, not to my knowledge. Not by the time I had left the team. This is not the team I am with right now. This is a prior organization. By the time I had left, no. I've kept in touch with a number of them. I do not believe that they have refactored in those directions, but when you have a system that's up and running, the trade-offs of moving off of onto those other systems, the infrastructure is built. So the benefit to moving onto Kubernetes would have been relatively low given that all the tooling was built and it was automated. But yeah. Some of it we did, but most of it no. Other back there? Microphone's coming, it'll be there in a second. So about the problem you've mentioned when you had to go all over your services and support simultaneously both versions of the message. What do you think of an idea? I think they call it message translator or something from the integration partners. So you just put version support for your messages from the very start. So if your current service does not support version, it just declines the request and doesn't process it. And so when you refactor your messages, you just put an extra service that duplicates the messages and translates the new version into an old one and then you refactor all of your services one by one and just remove the message translator after that. Does that work? Yeah, so for the message layer for the event bus for RabbitMQ, we actually had a different... We had a solution to that problem where every service was required to handle all of the versions of every event because we had that message stored. So instead of trying to upgrade in real time on the fly and only give the current version of the event, we actually mandated support for every event and that was tested. So we didn't run into that problem on the service layer where we ran into that problem... Sorry, on the event layer where we ran into the problem was on the HTTP layer with HTTP service calls. And could we have implemented another service to backwards translate? Absolutely. And that actually probably would have been a halfway decent solution to that problem. But we focused more on the nimble side of trying to keep the application up to date. And the type of changes that I'm talking about, most of them are more business logic changes than just structural changes. So having a single proxy turn A and to B would work for structural changes, but I don't know that it would necessarily off the cuff work for the level of business changes that we had that we would need. I think we have another? Hi there. So you've given us a bit of a health warning on microservices and that's much appreciated. What would your take be then on using that architectural pattern of a message bus to glue existing services together and to also be the glue for new services, as a CRM to talk to instead of directly to all these services that CRM has been talking to, go through the message bus instead, and have you got any steer and tips on that? Biggest tip is what is your single source of truth for data? Is your single source of truth an API? Is it that message bus itself? Or is it something else? Have a single source of truth for every piece of data. It doesn't have to be the same one, but every piece of data should have exactly one source of truth and make sure that that source of truth is robusted and anytime you need truth, go to that source of truth. So if you're using a message bus to communicate changes, that's perfectly fine. I'm not saying RabbitMQ should not exist or anything like that for an application layer. Absolutely should. But how do you handle a missed message? Do you have a mechanism for going back and reconstructing from a source of truth that's tested and that's robust? That's I think the biggest piece I would look at. Again, it's handling the failure case and so if your source of truth is that bus, then handling that failure case could be just replay from the bus and so you rely on the bus to deliver your messages reliably.