 So yeah, I'm Melanie. I'm here to talk about performing infrastructure migrations as scale, particularly like Airbnb's kind of scale. So like a pretty large company. And so I'm really happy to be here at KeepCon China. Oh, is my clicker working? Yes, okay. So I'm Melanie. I work as a software engineer on the service orchestration team at Airbnb. And our goal is to empower our own engineers to create, maintain and operate their own services. And we use Kubernetes as like the orchestration layer for our services. And I want to demonstrate best practices for migrations. Yes, sorry, yeah. There's like not that many seats in here. But I wanted to demonstrate the best practices for migrations based off my firsthand experience of migrations at Airbnb. So I think one of our best migrations case studies is Kubernetes itself, but I'll also talk about a few related migrations that we worked on and like kind of demonstrate the best practices from there. So today about 70% of our services are in Kubernetes, mostly business-facing services versus the legacy system. And we have about 300 critical services in Kubernetes running in production. What exactly is a migration? Because I found that for a lot of people they have different ideas. So like you can think about if you use on-premise versus cloud. Yeah. What's that? Is this better? Can you hear me now? Yeah. Yeah, okay, cool. So yeah, non-cloud to cloud, VMs to containers, configuration management to orchestration, API framework changes. Like if you want to introduce circuit breaking or request fluttering, replacing entire systems like we've replaced CI systems, build systems, deploy systems, having a new service proxy or service mesh, a new language framework version. So like a JVM upgrade, security patches and more. So that's like example list of migrations. It kind of goes on and on. And for me, I think it's helpful to reference migrations in different dimensions. So like when you think about migrations and what a migration is, there's like low effort and high effort migrations. So like a security patch technically is fairly a low effort because you're just bumping, like you're a bunty version or something like that. Whereas like a very high effort migration would be like going from non-cloud to cloud because you have to somehow like rewrite everything to work in a cloud environment or containerization. So like now everything has to run in containers. And so like the difference here is like, are you just bumping a number? I mean, it might be more effort than bumping a number but like that's lower effort than like something that involves system design, a re-architecture and like really rethinking how you build things. Another dimension I like to think of for migrations is whether it's urgent or discretionary. So again, like a security patch is fairly urgent because you want to be more secure immediately as soon as the vulnerability is found. And a more discretionary migration tends to be things that can take months or years to unfold and you generally have a lot of time to do them. And so like a very urgent migration, the way that you resource and plan that and the risk you're willing to take on is very different than a discretionary migration. And then the other dimension I like to think about for migrations is like what you need at any scale versus what you need at a very high scale. So those regressions you need regardless of scale like upgrading your language or framework version and then those migrations that you don't really need to do until you hit a certain scale. And then once you hit that scale you really need to do that migration. And so some examples of things you need to do at a high scale is update your storage layer or your orchestration or your service mesh. These are things that when you have a lot of load, a lot of traffic, you'll start to realize that you need these migrations. So these are the three dimensions I think of it. So different dimensions and you'll kind of realize that all of these dimensions are a little bit related. So for example, the higher scale you're at it's more common that the migration becomes more urgent and also more common that the migration becomes higher effort. So they're kind of all multi-dimensional. So that's kind of how I think of migrations or what they are, different dimensions. But why are they important? Because they're kind of a bit of an effort to do. So when you're a very new company and there's actually probably quite a few new tech companies here in Shanghai, you don't really have to worry that much about migrations. But as your company matures, migrations kind of become a fact of life. So Airbnb is 10 years old now, so that's like a lot of things that, I mean like a decade of technology is really out of date. So you're sort of replacing all of the pieces as you go. And so you find that as your company grows, you accumulate a lot of tech debt and it becomes important to reduce that tech debt to maintain your velocity and your competitive edge. So migrations are a way to reduce this tech debt. So like the biggest tech debt for large companies is low developer velocity or like low developer productivity. Developers are less productive and less efficient because of the amount of tech debt. But there are more kinds of tech debt that's a little bit more subtle. So like when you're scaling exponentially, it's like a hyper growth company, you accumulate tech debt much faster. And so you need to keep up with the tech debt. So here's a few other examples of tech debt that you will run into. Just like a few scaling limits, networking issues, systems that have now been end of life, et cetera. So yeah, migrations are, that's where they're important. They're basically the sole lever to systematically create technical leverage at scale. So now that we know what migrations are and why they're important, I want to kind of spend most of the time talking about how do you do migrations well and like what are the strategies you need to succeed at doing migrations. And also think like some examples of things that don't go that well. So the first thing I think is it's helpful to break migrations down into categories or what I call migration types. And so I have this nice graph. Basically there's three categories, like a component update, which is what you can think of as an upgrade or a patch, a system, which is moving from one system to another and an infrastructure kind of migration, which usually involves rewriting or replacing an entire infrastructure or architecture. So yeah, components, upgrades, patches, refactors. A system is a move from one system to another and then infrastructure is really like a huge change. And I think it's very helpful to start here because when people talk about migrations, they mean all of these, but the strategy is very different depending on like what kind of migration you're doing. So here's an example for us. Like we do all these kinds of migrations, components like upgrading systems, operating systems, languages, deprecating things. As I said, replacing like CI CD system or like new service mesh networking is kind of a system level migration. And then infrastructure for a lot of companies is like moving to the cloud or containerization or K8s or like serverless. Any of those kinds of things is really like a big infrastructure change. And so the strategy is here is to know which type you're working with. And like that's really the first step. And also know that for each, as you go up in this hierarchy, you have exponentially increasing complexity for each type. Like it's way more complex to do an infrastructure rewrite than a system move. And it's more complex than a component update. And so that you should also know that for each of these, you're dealing with a lot more complexity, a lot more overhead, and therefore it's a lot riskier. So it's a bit more dangerous to do a bigger change. So for me, one of the most important migration strategies other than just identifying the type of the migration is how are you going to sequence this migration? And so I'm gonna spend a lot of time talking about this because I think what is really complex about migrations at scale is that they're all interacting with each other. And so they're not in isolation. They're basically all affecting each other. So basically how do you think about sequencing migrations? Basically, a complex migration tends to have a lot of dependent migrations that you may have not identified yet. And this requires a sequencing, basically planning the migrations one by one. And so if you have a lot of, if you don't do migrations very often, so if you infrequently run your migrations, you actually have a kind of cascading migration problem. And similarly, if you are not good about finishing migrations, if your migrations are inefficient, then you have a lot of simultaneous migrations that are all running at the same time. And so I like the term cascading migrations where basically if you really decide, oh, I don't wanna do a migration today, I don't wanna do a migration a year from now, and then finally you start doing your migration, you'll realize that there's a lot of things you need to do to reach what I call the ideal migration state. Like I want to run in Kubernetes, like what are all the things I need to do to get there? And if you don't really like plan it and think about it, you'll have a lot of dependent things you need to do first. And so one way to think about it is if you're very good and careful about planning, you don't have to do a rewrite. But most people, if they'd wait too long, they kind of have to do that. And so there's a lot of complexity involved. And so one of the reasons I like to call it cascading migrations is because it kind of reminds me of cascading failures, which is a bad state to be in. And so it's kind of a funny way to talk about it. So you don't really want this to happen. And this is also kind of what happened at Airbnb. So like we were running a very old configuration management system. At this point, it's probably five to nine years old. So in order to get to Kubernetes, we really had to make a lot of changes very quickly. So we were already in cloud, so it wasn't as big of a jump. But you can imagine that like if you want to go to Kubernetes and you're not containerized yet, you're not in the cloud yet, these are sort of dependent migrations you have to do first. And so if you try to do all these at once, it's just a lot riskier. So you want to consider doing them incrementally first. Okay, so simultaneous migrations. Basically, if you're inefficient at migrations, you'll likely be performing a lot of them at the same time. So inefficient migrations can cause your overall velocity to slow down because you're doing all of them at the same time. And they don't necessarily depend on each other. So unlike cascading migrations, where like you have to do this one and this one to get to the ideal migration state, all these migrations don't necessarily have to do each other, but they still kind of add to the overall complexity of the system and introduces additional risks. So here's one example, basically moving, and this is like a real Airbnb example, except we do way more migrations at the same time. But here's three we're doing at the same time, right? So I'm working on our Kubernetes migration. Similarly, our CI team is trying to replace our CI system because the old one doesn't scale anymore. And our CD team is trying to introduce deployment pipelines so that developers can like have a canary environment and test their changes better. So the first thing you want to ask is, are these migrations really independent? Like what CI system you use and what CD system you use actually kind of depends on Kubernetes. So for example, like if you want to use Spinnaker or something like that, then that's like Kubernetes specific CD system. And CI also you probably want something that's containerized. So actually these migrations are somewhat related. And is each migration making assumptions about your system? So if you start the K8 migration at time A and then start your deployment pipelines migration slightly after, are you assuming that you're still in that previous state? Like will it work once the other migration is done? And so if you have really slow and fast migrations all happening at the same time, you may realize that you didn't account for the finished state of another migration. And so does your migration actually support a mixed state from another migration? So this is a really common mistake that we've had here which is like how can we make sure that in the sense your migration works with all other states that the system could be in? This migration is finished, this one's not. So it's very complex having all these happen at the same time. So what's the strategy here? Those are kind of things that can kind of go wrong. So like for me, I call it sequencing migrations. Like let's plan and think about how we're going to do this. And so you want to avoid the infrastructure rewrite. So let's do a lot of minor migrations early on. So we want to make migrations themselves happen a lot of the times are frequent and very efficient. So that we can kind of like production line do them really quickly. And then finally, like when you're doing the migration, try to keep it not too open-ended. So like let's talk about very specifically what we're going to build and not like sort of like, you know, it's a very large like let's move to the cloud. Like can we break it down into small sequential steps? And then finally, sequence migrations are a lot safer. So like if you have all these migrations running at the same time, that's kind of, there's a lot more risk in different edge cases. So yeah, strategies, lower risk migrations and it requires more planning and time to get to the state. And so one other thing you can do is paralyze migrations that don't have dependencies kind of similar to a topological sort. So one example I have with our Kubernetes migration is like we have several things we need to do first. One is we were running a really old version of Java. This version of Java was not container aware. So, you know, if you try to run containers or Kubernetes using this old Java version, you know, we had problems where the JVM was allocating more resources because it was allocating resources based on the host and not the container running on the host. So all of these containers were trying to use more memory and more CPU than the machine had. The other thing we had to do is upgrade our service discovery layer. So we're only using an old version of service discovery that works really well when your hardware doesn't change that often. So you have like EC2, Amazon EC2 instances that kind of stick around and they don't move that often. And so the service discovery layer can handle that but when you're using Kubernetes and you have these pods that are constantly being rescheduled, the service discovery layer kind of breaks down. So we have these two upgrades. They don't have any dependencies on each other but they both need to happen before you use containers. So like you could start those first then do the containerization effort and then do a Kubernetes migration. So yeah, then the containerization effort. And one other thing I wanna talk about here is prioritizing migrations. So let's say you have a bunch of migrations that you want to complete. What can you do to decide which ones to do first? And one way I like to think about this is you should prioritize migrations that make the other migrations easier. So one example is we migrated to infrastructure as code first. So we had a lot of infrastructure settings that were not in code. Like think like just like a basic Amazon web console. So like you didn't, you had no way of really storing that information in a way that developers could edit it. So like infrastructure as codes could be Kubernetes, could be Terraform, Puppet, Chef, anything like this. It's better than not storing it as code. And once you do that you can actually, your migrations can be code refactors. Once you have code refactors, you can actually build tooling to automate it. So that's really useful. If you decide to migrate a bunch of infrastructure before doing this one, a lot of that's gonna be very manual migrations. And you can't really automate them as easily. So this is one of the biggest things you can do is try to prioritize them. Cool, so that's some basic migration strategies. One other thing I wanna talk about is like what makes migrating at scale so especially difficult? Like even if I know to do all these things, why do we still have some trouble? So I kinda wanna come back to this graph. So like let's say you're thinking about replacing your service mesh and you sort of like forecast that you probably have a few months or a year to replace it. So you kind of want to do it right. You kinda wanna like plan it out and slowly start migrating your edges over. And then kind of sort of suddenly you realize that you have like a lot of failures that are unacceptable that suddenly start happening and now like this migration is very urgent. Whereas before it was not urgent. And so this is kind of why migration at scale is tough which is like because you have increased scale and you're experiencing peak traffic and peak load, your systems may suddenly like hit their limits and not scale anymore. And that kind of causes it to be urgent. So like because it wasn't urgent before but now that you're at the scale it is urgent. And like what I like to call that is sort of forced urgency where is if you're able to properly forecast or plan for this and like and finish the migration before it becomes urgent, then you kind of have all this time. But if you don't really foresee it then like once you're there you suddenly have to do the migration and now it is urgent. And so this is actually really common when you're like when you're working for a company that's at a scale that not that many other companies are at where you sort of have to do a bit more due diligence on like planning for these events. And then finally the other thing that makes it difficult when you're at scale is just like the sheer effort involved. So like for us the Kubernetes migration has been, you know at least one to two year effort and we're not done yet like right, we're at 70%. So why does it take so long to migrate? And the part of that is because of just the surface area. So imagine if you only have one service or like one database or whatever it is that you need to migrate. So you do spend some amount of time making it possible to do the migration and then you do that flip switch and then you've migrated. So but for us it's like well what if you have 10 services or 100 services or 1000 services and then you really start getting into like you can't just manually be switching these things over because it's just gonna take too long. And it's just there's more effort in migrating. And then the other thing is it's sort of like when you have that many cases you're more likely to have weird services or special services and like you start getting into sort of edge cases. Oh, sorry I forgot to click. So yeah, basically you have to migrate more services and there's increased effort because there's more complex services involved. So like for example, here's one other migration we've been working on which is we wanna switch our proxy from HAPoxy to Envoy Proxy. This works for maybe any proxy change you have to do anything to do with service discovery. You basically have way more services. All of those services talk to each other. So you have this very complex like graph of services all like all these edges talking to each other. And now like because you're having problems with your, like you're replacing your proxy because you're having problems with your previous proxy. But this means that like as long as you haven't migrated to the new proxy, whatever problems you have are exponentially getting worse as like more and more services are added and more edges are added. So it's sort of compounding with the number of edges. So for me, the biggest thing here is to try to forecast expected load and do like load testing and chaos testing to try to catch these things before you're there. And the other thing is to try to stress test your system for actual load. And so try to get ahead of this problem with long-term planning and forecasting and like realize that if you've sort of failed to do that you're in a different state. So you're either doing it long-term planning and you're doing it well, but as soon as you realize that you're like have a short-term time to do it then realize you're in firefighting and now you need to move very quickly. And so there's like, ideally you're doing the forecasting and the stress testing. But once you realize you're having problems you might need to have a very aggressive timeline. So one other strategy I have here actually this is even more important which is make time work for you. So basically try to deprecate the old thing first. So I talked about how you have like exponentially more edges, right? How can we make it so you don't have this exponential problem? So make the new approach, the default as soon as you can. And then instead of like sort of racing against trying to migrate things as they're sort of popping up you actually have like all of the new things are already migrated because they're using the new thing. And so now you just need to migrate all of the legacy things. So one example here is our, this is our infrastructure as code. So I have like a bonk service and it has an underscore infra directory and this directory actually has all of our service configuration. So we basically wanted to move service configuration to each service. So this is in my bonk service. And since we have exponentially more services being created, we actually created a generator that generates the new services using the approach. And so when we had new services being created they're already using this thing. And then now all we have to do is we can kind of like like take a sigh and just migrate all of the old things. If you don't do this you're gonna constantly be trying to catch up to the new services being created and then trying to migrate them as they're being created. So this is a really, really good strategy. One other thing I wanna talk about is migration overhead. So basically when you do a migration I talked about how it reduces tech debt, how it creates technical leverage, but while you're doing the migration itself it actually introduces complexity to your system because you're dealing with a mixed state. So migrating is actually an explicit trade-off between taking on overhead now to reduce the worst overhead later. So it creates overhead for those who are running the migration effort, for those who are actually migrating and as well as for those who are maintaining the old thing and the new thing. So talking about our proxy change, you basically have exponentially more services and edges and you have a bunch of different ways to use them. I don't know if you've ever looked at like all of Envoy's documentation but it's a lot. So imagine if you're trying to use all of these features. And so you're basically trying to, for us we're trying to patch how we use H epoxy while building out the new thing. So there's a lot of overhead in maintaining the mixed state. And so this overhead kind of looks like this where you wanna reduce the tech debt via migration. So developers want this, they basically want to spend some time reducing tech debt and the rest of the time actually working on what they work on. And what developers actually get if you're not careful is making progress on their thing and then us saying, oh, can you migrate all these things? And like actually at Airbnb, we counted how many migration asks we were asking developers to do and it was over 40 migrations. So like developers were like freaking out because they're trying to make progress on the products and we're asking them to migrate all these things. Even if each migration is like fairly simple, there's no way they can do all of these things. So we basically have this problem where if you have worsening tech debt and you start a new migration and then you don't really finish the migration, you actually just get worse tech debt again because you have a super complex system. And so this is something you really wanna be careful of because basically I kind of also think of it like bingo. So like each of these is a different state in your system. So maybe this system is using the old CI but Kubernetes and the old CD or the new CD or whatever it's like using all these different things. And so you have a bunch of problems where future migrations are now harder because you're trying to support all these different mixed states. Tech debt was worse because developers aren't really sure what they're using and like how to fix certain problems. And mainly there's a bunch of edge cases that you just don't account for. And so I kind of call that like the bingo which is you hit an edge case that's like so subtle and so few services are running into it but like because of all of the mixed state that's like why they're hitting it. So you don't want your infrastructure to look like this. So basically for me the biggest thing here is unfinished migrations actually make tech debt worse. So you really need to finish each migration. And I have a few strategies to end on with that which is basically how can we make it so migrations are finished. So developing abstractions over the infrastructure is very important. So you wanna make the current migration easier but you also want to avoid leaky abstractions so that the future migrations are also easier. So like one example I have here is like the Kubernetes files. So we have all these Kubernetes configuration files and we built an internal abstraction over them because we know that in the future, five years from now we won't be using Kubernetes we'll be using the next technology. So how can we make it a little bit better now? So these are all concepts on the left that Airbnb engineers are mostly familiar with. And so we're telling them like, we're trying to make them not think too much about whether they're using Kubernetes or not. And so this is our abstraction. But one thing I realized is like actually this abstraction still leaks several things like the containers and like some of the Kubernetes settings actually we didn't fully abstract them away. So like what's a better abstraction? And so this is actually what we're working on now which is like what if we have like a manifest that just says, you know, this is my service it uses a lot of resources or a few resources. You know, it's a web service or a cron job or like a workload. And like these are the services it calls these are the other services it calls. And like can we just limit it down to that so that like when we migrate to the next thing there's actually no effort at all on the service developer side. And like actually we migrate the whole thing and service developers don't know that they're using Kubernetes and they don't know that they're using the next thing. So we're kind of talking about building like a whole platform on top of this. And so this is like something I kind of want to build internally. And we also want to do this for our service discovery layer. So like as we migrate to Envoy and as we migrate to the thing after Envoy we want to make it as easy as possible. And so this is a strategy for me for Airbnb but I think actually this is a strategy where the whole industry is going. So today when you're configuring Kubernetes you probably get frustrated by just the amount of configuration that you're doing. So like I think in the future this will become a lot easier but we're still working on building that platform. And so that's sort of like how can we make migrations better as like a Kubernetes community. And so that's something I've been thinking about a lot. And so here's some other strategies. Basically when you're thinking about migrating can you standardize on something that 90% it's like 90% correct for most cases. And can we automatically migrate that case? And so then like can we migrate under an abstraction layer so developers aren't aware that we're migrating? And then finally can we migrate programmatically? So can we do code refactors to migrate people? So we have something called a factorator. So it performs refactors. It's basically because we have infrastructure as code now we've unlocked it automated automatic refactors. And basically this process is a bunch of scripts that kind of cover the whole refactor itself. So if you think of like how a refactor should look basically we have a Kubernetes con job that runs and it can look at a code base and know what state it is in and what state we want it to get into like run the refactor. So it has this logic where it basically can check out the repo, find the project in the repo run the refactor job that gets it to state B and then tags the owners of the service and creates the PR. So no developer does this. The developer writes the refactor job. So you need to be able to as input say go from state A to state B but the rest of this is automatic. So the refactor job will run and create the PR and then the refactor job also can like comment on the PR telling, hey, like, can you please look at this owners? Can you please edit or merge the PR? And then as well we have the refactor job it can just merge the PR. So we actually have some refactors with no developer involvement at all. We just run it, create the PR, merge the PR. So like you can imagine so a lot of our services use Docker files they inherit from some base image if we realize the base image needs to pick up a security fix so we can run our refactor and like immediately give everyone the new security fix without them being involved. So that's really powerful. We're not asking the developers to do it we're just doing it automatically. And so what can we do automatically? So like what Kubernetes version you're using that can be automated. Base image upgrades like for security patches we can do that programmatically security patches changing the CI CD system is a little bit tricky but we've done that we did that about like 90% automatically for our CI migration like two or three weeks ago. So we basically ran a refactor that created a new created the new CI job and then if the new CI job passed then we would merge it and if it failed that means that developers had to like look into why the new CI was failing. Some other things also. And I wanted to end on the migration program. So basically when you think about your migration strategy what are, what is your migration strategy? Do you make one person do all of it? And I think maybe if you're a very small company you can make one person do all of it. And what's nice about that is it's a very tight feedback because they're doing the whole thing. What doesn't work is that if you have a huge company you can't do the whole thing. One other thing we do is make devs do all of it. So we basically ask devs, hey, so I asked developers can you migrate this? Problem with this approach is that everyone's always asking developers to migrate. Are they actually going to do the migration? How do we make sure that we finish the migrations? Cause I just talked about how bad it is if migrations are left unfinished. So if we want to actually finish the migrations we need a migration team that owns the migration end to end and figures out how to make it as efficient and frequent as possible. So in our case we have a Kubernetes migration team that actually is running this effort and I'm part of that team. Oh sorry, there's one other thing I want to talk about which is the life cycle. How do you know whether you actually want to do a migration or how do you do it? So you start with a design document and a prototype and you stress test. But one thing you should make careful here is that you need to make sure that the technology you're using really works for your hardest cases. So high load, high traffic. Don't start with the easiest case and assume that you have validated your migration. So make sure that it works with low throughput, high latency, all these requirements that you have. So really make sure that it works. Don't just have a simple prototype and assume it'll work when you get to really complex cases. So we did that and then an enable phase is basically building that abstraction layer and writing the documentation and code labs and then programmatically migrating everything. So once you really think the migration's ready to go, you have all of this tooling. Like this is the phase where you really want to spend your time building good documentation and tooling and a good abstraction layer because this is what's gonna make future migrations better. And then the finished phase is the phase we're still in but basically you want to iterate until you've fully migrated the system. So we've programmatically migrated 70% of our services but we still have 30% services that are very complex. So we're working with the service owners to migrate those last ones. So we have like 10 years of services so it's a lot. So yeah, the finished phase and I wanted to leave time for the 10 takeaways. Basically, you want to identify a migration type. You want to run frequent, efficient and tightly scoped migrations. You want to do migration sequencing like prioritization and planning them and you want to actually stress tests and forecast migrations before they become urgent. You want to make the new approach the default so that time is working for you, not against you and you want to fully finish the migrations to reduce tech debt. You want to develop abstractions over the infrastructure and run migrations as code refactors and then when you're actually thinking about the migration you need to run a migration program with a lifecycle and then you need to iterate on the migration until it's fully vetted, enabled and finished. And I'm out of time. Okay, thank you.