 Let's talk about how to replace a jet engine of your system in flight. Today, we'll talk about zero downtime migrations and how we can make that happen. I'm my solo Greenberg. I work at Google on Google's infrastructure. My current project is Google Drive infrastructure. But today, we'll talk about the project that I work on as part of the development infrastructure. I worked on the distributed belt system. And we'll talk about how we did zero downtime migration for that. So let's take a brief walk through the history of belt systems and how one belt system can become a distributed belt system in a company. First of all, let's do a show of hands. How many of you work with MAKE, MAVEN, Grado, Rake? OK, I see a few hands. OK, yeah, more people. So these belt systems allow you to execute the build. So they automate the compilation and the build process of creating binary artifacts automatically. And they do this on a single machine, on a single desktop. And the system I'll talk about today is a distributed belt system. Your company may not have the business need to use this distributed belt system, but generally, every belt system goes through a similar evolution of the following. So we'll start out with a project where one engineer is working on it and they work on some project that they need to build. And so they use the belt system on their machine. And the belt system will detect many dependencies, some of which could be stored somewhere on the cloud. So say, our project depends on TensorFlow. So the belt system will be able to download this. Now as the project gets more and more popular and a whole team is assigned to add more and more features to it, the team will have to set aside a machine and use the build system on that separate machine to be able to execute the builds and build their binaries, test them, and deploy to production. Now a company may have many different teams. And if it's a machine learning company, then maybe every single team will have a dependency on a machine learning library like TensorFlow. And so now every team will have to set aside its machine and which will execute the build. So we run the test, build the binary artifacts, and then deploy them. And so as the team grows and as the company grows and there are many different teams that have to do this, what they have is they have an overhead where they need to allocate the time and energy and resources to maintain that separate machine on which they do the build and test. And to be able to make sure that it's up to date running and that if there are any bugs, they can fix them. And so now this takes away time from their primary project, which would be their machine learning project in this example. And so this is where the team like my distributed build system team comes into play, where now instead of having every team set aside their own machine and have to worry about maintaining that, we'll create a shared infrastructure. Where now the team will maintain the shared infrastructure all of the machines and we'll ensure that the user interface, user experience is the best possible. And we'll make sure that we add new features and we maintain the availability of the different teams in the company. So the system I'm talking about today, the one you saw in the boxes, it's called BuildRibbit and Google. And it's a distributed build system. It sits very nicely in the middle of the test and build stack. So continuous integration system will send its builds and its test request to BuildRibbit to be able to execute the builds and run the test on a remote machine. And this is similar to some of you might be using Jenkins, right? So this would be what the system does. Now release infrastructure will use BuildRibbit system to be able to build artifacts and then deploy them to production. Individual Google engineers and other teams will use BuildRibbit system if none of the existing infrastructure suits their needs. And integration testing infrastructure will use BuildRibbit system to be able to build all the different individual services then bring them up, then run integration test to ensure that the system as a whole works correctly. Now BuildRibbit system itself depends on many different components of the build and test stack. First of all, you need to figure out what the source code state is for a given user and for a given commit number. And so the source system will give us an overlay of the source code state. And then BuildRibbit will go into Blaze, which is the single desktop build system similar to Maven, Gradle, that would execute the build. And then it would put the build artifacts into storage. And then BuildRibbit will have a way to retrieve those artifacts and serve them back to the user. So Blaze was open sourced as Bazel. And so if you're interested in checking out, it was open sourced last year. And you're welcome to check it out by its name Bazel. OK, so as we see here, BuildRibbit system sits very nicely in the middle of the build and test stack. And so as you can see, because of the numerous number of people relying on it and lots of requests coming in, the luxury of downtime is not an option for us. We can't just ask our users to stop building and testing and stop releasing their code. And so for us to be able to turn down our system and then swap out for a new architecture and bring it back on, that's just not possible. So we need to do zero downtime migrations. And to give you a better sense of the scale I'm talking about, there's 30,000 engineers in over 40 offices that use the distributed build system. There is 45,000 commits that go through the system every day. So that's 15,000 commits made by other engineers. And then 30,000 commits made by robots that is automated scripts. So anything we can automate away, that would be a repetitive task for the developer, we will automate that away. And that would still go through our build system. The source code is two billion lines of code. And half of that code base changes daily. And so every time they change to the code base is issued in a commit, we'll rerun all the tests to ensure that all the integration still works and everything still works as expected. There is roughly, on the order of five million builds and test requests going through the distributed build system called build rabbit. And it produces petabytes of output artifacts. And so when the project started out, it was quite small. It started out as an experimental project. We didn't know exactly what the niche for it would be. We didn't know exactly how our users will be using it. But as the system became popular and finally found its use and a lot more engineers and teams started depending on it, we started hitting limits of our architecture. So we started out with something very beautiful like pagoda, very beautiful architecture. And it made a lot of sense for many, many years and it worked really well for us. And we were able to have this project be up and running and make it very useful to the company. Now as more and more users started using it and more people found it reliable enough for their use cases, then we started hitting the limitations where it no longer supported the metropolitan style of demand. So we needed to re-architect it. We needed to build it to be able to support the existing demands of thousands of engineers as well as the growth for the build system. And so for the remaining of the talk, I'll talk about how we were able to do the architectural upgrade and how we did that with zero downtime. Again, just as you can see here, Build Admin System sits in the middle of Build and Test Stack. It has a lot of dependencies on the build system components, on the various source system and the build system itself as well as it's used by many, many different infrastructure projects and individual Google engineers and teams. So this task really felt like we needed to replace a jet engine of our system in flight. And I tried to find pictures that would be able to demonstrate what the task was about. And apparently you can refuel an airplane in flight. So you see here is a helicopter refueling an airplane. But apparently we cannot, as a civilization, yet replace a jet engine of our system, of our airplane as it moves around. So I had to default to my drawing skills. And this is what it felt like trying to launch this with zero downtime. So we needed to keep the system up and running, keep the airplane in flight, and we needed to replace all the inner workings of it so it was a lot more powerful of a system. So let's talk about how we did migration with zero downtime. For the remaining of the talk, I will continue using the same example with the distributed build system. But the takeaways are the same for your system if and when you need to do zero downtime migrations because the lessons learned and the main considerations are still the same regardless of the system. So to make sure that we are on the same page about the task at hand, let's look at the old architecture and the new architecture and where the difficulties lie. So we started out with a very simple client server architecture where we distributed the client library to our users and now the client will send a request to the build-out of the scheduler which will just access the load balancer and then it would return the result for which of the workers our client should talk to. And then the client would send the request directly to the worker and get the result back from the worker directly. So now as you can see here, this works fine and it works fine for many years but we started hitting limitations when if the connection broke between the worker and the client, client lost all of the progress. It wasn't able to figure out if we started from some point in time that it made the progress so far and so it needed to restart the whole build. It resulted in a very thick client library where now the client had to understand a lot of the internals of how build system works and had to interpret it whether the build system had some sort of transient issue or whether the request is bad and the client needs to give a different request that would be able to succeed. So the client library was in charge of a lot of utilization and routing logic. And this was very difficult. It became very difficult as the use case grew as lots of users relied on this because that meant that a lot of the moving of the infrastructure and developing and evolving it to become a better, more resilient system was a lot more difficult for us. And so we converted something we call build service where now we had the same client library so we used the old facade, we switched the internals all of the how they build is processed by our system. And so the client would send the, will put the request onto the persistent queue. And then when the worker is available and has some capacity to execute the work it would take the request from the queue. Now there's two different types of outputs that is generally produced by the build system. One of them is build artifacts. So these are your binaries and another one is build progress information. So this is which test succeeded, which test failed. And so the build system will, the worker would output the build artifacts as it builds them and compiles them and puts them to storage, which are accessible then by the build artifact service. And also it would start streaming the results of the build progress information. So some of our users only ever cared about the build progress information. They only cared to know which of the test failed so they could go back to their source code and figure out how to fix it. And some of our clients needed the artifacts. And so they could choose which of the output information they wanted and request that specifically. So all of the green boxes are part of the build system. While we kept the orange box, the client library the same. So it was putting the side by side. There is the older architecture and the new architecture. And as you can see, it wasn't very easy to translate the outputs and the workflow through the system from one system to another. There's a lot more different boxes. The flow of the data through the system is now very different. And so we needed to figure out what it meant to transition it from the old architecture to the new architecture with no downtime. And so the first of all, what we needed to realize is that we knew what the old architecture looked like. We have been operating it and developing it for many, many years so we understood it very well. Now we understood what the new architecture is like because we built it specifically to solve a lot of the problems that we've encountered that we want to solve for our users. What we didn't understand, what was the intermediate steps that we had to take? As an example of the intermediate steps, if we wanted to migrate our architecture from using the scheduler to persistent queue, the way it request went is the client library sent the request to the scheduler, then it got the response back. Then it sent the request to the worker and then it got the response back about all the outputs of the build request. So now what happens here is we want to bring up the persistent queue and we want to start sending all of the new requests to the persistent queue. So once the persistent queue service is up and running and it's ready to accept the workload, we would stop sending the request to the scheduler and after a while, all the new requests will start going to the persistent queue. And the reason for this is because we want the system to be available for the users. So we want them to continue sending us requests. It's just we want to change how and where they send us requests. So now there were a few in-flight requests that went into the scheduler and so after they've completed, we'll stop receiving the responses from the scheduler. Now after a while, because we stopped hearing from the scheduler which worker they should talk to, the client library no longer needs to talk to the worker directly to send it to request. And as soon as the worker comes up, has capacity and is finished with the previous request, it can dequeue from the persistent queue. And now after a while, after all the in-flight requests that we're talking directly to the client library from the worker completed, we can stop that communication channel altogether. And now for the two different outputs is not shown here, but we had to come up with other ways to serve the request results to our users. And so I let the launch of the dequeuing aspect of the architecture. And it entails not only all of the steps that I mentioned, but also we work with five different teams in three different geographic zones across two different time zones. And so understanding very well what intermediate steps will we need to take in order for us to migrate to this new architecture was very important to us. Now we also needed to focus on ensuring that we can bring up our backhands and they can serve the new load and they can withstand the existing load. And so what we needed to do is first of all, focus on migrating of our backhands. We need to ensure that the new backhands, the new surfaces that we bring up are able to serve the existing workload well and do no worse than the old system and also ensure that we can scale it up for growth. Now it wouldn't have been possible without also understanding of how to decouple this thing. So not only do we need to migrate our backhands and figure out what the intermediate steps are, but we also needed to ensure that we can launch some of the architecture before all of the backhands are ready. And so we needed to figure out how we can break up the launch, how we can break up the migration so we could decouple the launches of services. And then that being split something along these lines where the built artifact service could be independently launched from the built progress information service. And then they took care and they persisted that they needed to be launched together. And so what this allowed us to do is first of all, not be blocked by any one of the services being not ready for the launch and still proceed without launch. And also what this allowed us to do is we needed to consider very carefully about what the system is going to look like if only one of the backhands is up, if all of the backhands are up or if partial backhands are up, if partial states of readiness and what that means for our system. And what this really meant is we needed to consider how our system would look like and write a lot of throwaway codes. The way I visualized it is every single different state, intermediate state that we had, it was like an island and then the throwaway code was like bridges connecting them. And so what we needed to do is we needed to consider all the different intermediate states that we would have and how they can be connected together with the throwaway code that is not necessary after we're done with the migration but it's essential for us to be able to do this without any downtime. And also as I mentioned earlier, we had a client library that we distributed to our users and we couldn't ask them to follow our release cycle. So we released weekly but at the whole point of having a shared infrastructure is that we didn't have to rely on our clients always having up-to-date clients. And so what that meant, we needed to support our clients months back so that they could focus on their work and we could proceed with making changes and evolving and improving our system. And so what that meant is that we needed to support months old clients. So we needed to consider what the state of the system was like months ago and what the state of the system is now and how to translate between the two. And so there was a lot of throwaway code involved but that ensured that our old clients that haven't rebuilt their client yet that haven't used the new client yet could continue using the system and get the results that they expected. Now we of course couldn't just turn off the system and then turn it back on or just turn on all the new code paths immediately for our users. So we needed to do an incremental rollout and this wouldn't be possible if we didn't have access to the system like Mesa's or what we use internally at Google Borg. So the incremental rollout looks something like the following, so we'll have two different zones and then instead of launching to all of them we'll do one by one first to one machine in one zone then to the rest of the machines in the zone then comparing the zone side by side. And so what this allowed us to do is we could utilize a control of individual machines. We could compare metrics side by side and see if we have worst latency through the new code paths or we're doing just fine. If our resource usage is a lot higher in the new code path and whether we can explain it or if it's a lot lower and that's what we want. And so it was very important for us to be able to launch to very specific places and control which of the zones, which of the machines are running the new code to be able to have a better confidence in our launch. This also meant that we needed to focus for which clients could benefit most. We needed help from our clients and so we asked them, we asked some of our clients who could dedicate their time and energy to help us verify that the launch works as intended and it solves their problems and we needed to find clients for whom the benefits of the new system would be highest and so it was worth it for them. And so it ended up being roughly along the boundaries of how backend was split, how backend launches was split. Where some of the clients, the client that I'll launch for the de-cure aspect so that's the persistent queue and the build-up in worker. That was our integration testing framework and continuous integration testing. Infrastructure and so we needed to make sure that we can work with them very closely to ensure the success of the launch and ensuring that it is what we are launching is correct and will work well for the rest of the clients. And so this also allowed us to isolate which clients are heard. So the higher risk clients, those who are working on with us directly, we could isolate the impact we have just on those clients from the rest of the clients by ensuring no user-visible impact for all the clients not participating in the initial launches. And in order to be able to roll it out incrementally, we also needed to have maximum visibility into our system. We needed not only access to best metrics, we could have to ensure that we can compare side by side the latencies, whatever metrics we used for user experience and the resource usage and the health of the system. We also needed to make sure that we had good access to all the different laws and we could tell whether the system is healthy and is up and running and is doing what it needs to or we could detect early whether something is going wrong before we affect lots of users and we can roll it back safely. And also having really good visibility to how our system is doing was essential in ensuring that we could debug it afterwards without having to reproduce it in production again. So if we were to think about the launch as a performance because we have one shot at doing it well once and getting it out to production, then it's very important that we practice that. And practice was the essential component of our successful launch for production. So as I mentioned for the intermediate steps, we needed to ensure that we understand all the different rollback steps. So as we send through the old code to our clients and then the new service comes up and if it starts failing to accept new requests coming in, we want to ensure that we understand what the rollback steps are to be able to roll it back quickly and efficiently. And this also meant that for every single intermediate step that we needed to take during our transition, we needed to understand well what the rollback steps would be in order to get back to the correct state and produce a few number of unwanted side effects. And if all goes well, then we could just proceed with the launch. Now, in order to be able to practice this, we found that checklists were very, very important and very helpful for us. There is a book by a former surgeon called Atokawande. The book is called Checklist Manifesto in which he talks about the value of having checklists. And the checklists are not there to educate us, to teach us something we didn't know already. The checklists are there to help us prevent the simplest mistakes. So when the surgeon or the nurse go into an operation, this is not for them to learn how to do the job, it's for them to prevent the simplest of mistakes. And it was helpful for us as well because as we were getting closer to the launch, a lot of us were tired and excited and it was a lot more likely that we would make some simple mistakes that would push the launch back into it and we couldn't launch it immediately. And so by preventing the simplest mistakes, we could get through the launch and make sure that we don't make silly mistakes along the way, we could just go on with the launch. Now, the checklists were also very helpful for organizational resilience. On the date of the launch, if somebody has to take off for family emergency, for lunch break, for the go on vacation, that didn't matter because anybody could step in. They didn't have to be trained or have to have practiced with us, but the checklists were detailed enough that it allowed us to be able to truncate this knowledge and quickly run up anyone who was stepping in to help us out. In order to be able to launch with zero downtime and no use of visible effects, key considerations that we needed to ensure we understand and we work on and we focus on and we ensure that we are prepared for. We needed to ensure that we understand all the different intermediate steps for our launch. So we needed to ensure that we understood how to get from our old architecture to the new architecture and everything in between that needs to be done. That meant writing lots of throwaway code that wasn't useful after we launched, but it was able to give us that support that we needed during the migration. That meant that we needed to migrate our back ends first because we needed to ensure, we needed to prove that our new architecture is actually solving the problems and our design and implementation of the new services is able to support the existing workload and also can be scaled for growth. And so a lot more energy was put into understanding all the different intermediate steps and refining our plan. It's important to be able to roll out incrementally, especially if you're going for zero downtime migrations and it would have been possible that systems like Mesos or what we use internally at Google Board because it allows us to have a lot of tighter control over which machines are running workload while that was rolled out seamlessly to our users. And we were able to compare side by side how our new code path side doing, how our new system is doing and how our old system is doing and see if the user is experiencing any issues with the new system. And finally, practice made it very important and made it very, that was the key component to our success of launching to production with no downtime. It ensured that we are well prepared for anything that could go wrong, that we know well how to roll back to the old state of the system and try it later. But in the meantime, we can go back and provide a good function of service to our users. And this meant that we used a lot of checklists to be able to coordinate across many different teams throughout different time zones and it enabled us to launch this new architecture to our systems with zero downtime. That's it for today. Thank you.