 Hello everyone and welcome to the CNCF end user lounge where we explore how cloud native technologies are adopted by end user organizations across different industries and sectors. Just as a reminder, the CNCF end user community is formed of more than 150 vendor neutral organizations that use open source tooling to deliver their product. My name is Katie Gammanje and currently I am the ecosystem advocate at CNCF. And today with me, I have Steven Chan and Sunil Shah from Airbnb. Thanks for having us. Thank you for being here. So in this live streams we bring end user members to showcase how their organization navigates cloud native ecosystem to build and distribute their services and products. Join us every Thursday at 9 am PT. Just as a reminder, this is an official live stream of CNCF as such a subject to the CNCF kind of conduct. Pretty much please be respectful to all of the fellow participants and presenters. If you have any questions for us will be monitoring them throughout the stream. So make sure to ask you questions in the live stream chat. And as mentioned today have Steven and Sunil from Airbnb. And we're going to discuss how Airbnb manages a dense service oriented architecture of thousands of services across dozens of clusters. Now before we jump into some of the questions. Steven, Sunil, would you like to introduce yourself please. Sure. So I've been working on Airbnb for the past three and a half years. And so I've had the opportunity to work on two different teams working with cloud native technologies. The first one is our computer for team, which manages the operations and scalability performance and so on of these clusters as well as tooling on top of them like how we generate manifest and how we integrate with existing infrastructure. And then the second team which I'm currently on is our service mesh team, which is building out the next generation of how services are observable. They are secured and how they discover each other. Yeah, my name is Sunil. I manage the computer for team at Airbnb. So I've been here around 18 months. Prior to that I did a very similar thing. Yeah, but we used maysauce and prior to that worked at a company called Mesa's Fair which developed the open source Apache maysauce project so Kubernetes is kind of new to me since I came to Airbnb and I'm very familiar with the idea of container orchestration. And yeah, the team is really focused on making Kubernetes the de facto compute platform at Airbnb. Nice. I'm actually quite excited to hear more of the work your teams are doing at Airbnb. And the first question I'd like to start with is could you tell us more about your platform or infrastructure setup, but more importantly why cloud native tools are a cornerstone in creating and shipping your services. So I can talk to a little bit about that. I think in order to get a good sense of our current setup, I can talk about our journey here from the very beginning of Airbnb. So back in 2008, Airbnb started out as a single monolithic Ruby on Rails app running in a single AWS account. And that worked very well for early days and launching lots of features. And then as the team grew and the company grew in terms of users, we naturally had to start splitting things up. So we had a lot of tooling that started out dedicated to just this monolithic app. We had a separate deploy app. We configured the hosts which ran their application with Chef. And then this didn't scale so well as we split up into an SOA. So users had to end up going into multiple different repos to change their configuration. They had to deploy each repo in a different way, deploy that configuration. It was kind of error prone. They also had to manage their own hosts and not everyone is comfortable doing that. And then auto teams were comfortable doing that. And so we wanted to reach for kind of a more centralized solution, something that allowed users to deploy their code and configuration in the same repository the same way. We didn't want them to create cloud resources by hand in the console anymore. And so we started looking for what are the tools that we can use to help us achieve these goals. And one of the early tools that we reached for was Kubernetes. And so this was about in 2016. There were many iterations in between. I'll start with Kubernetes. So what Kubernetes does is it lets us integrate really well with our existing infrastructure. So previously we had, like I mentioned, users had to go into one repo to configure their hosts, one repo to configure their alerts and their dashboards and so on. And so for, we built a abstraction on top of Kubernetes called one touch. And that allows users to have this folder called like underscore infra in their application repo and underscore infra. There's a few files, one of them is called cube gen. And so it's just a YAML file, which allows users to configure things like the services that they'll need to discover their CPU and memory requests and so on. And that gets generated into Kubernetes manifest. And there's also files on the underscore infra folder, like alerts, which allow you to, they become transformed into custom resources. And that that integrates well because when we deploy all our Kubernetes manifest and the custom resources. Then we have a custom controller, which listens, and then kind of make sure that our alerts provider is synced up with the definition that users have. And so there's a lot of a lot more things that we can dive deeper into, but that's kind of a whirlwind overview of some of the pieces that we went into. It's really great. Actually, you've mentioned that the previous state of infrastructure didn't really allow for scale or even to have a manageable way to deploy your services. So there's definitely one of the, I've seen in the past, this has been quite a core motivation for end user companies to move to cloud native. Now you've been mentioning Kubernetes that you use in your platform, but you've mentioned when you introduce yourself that you are involved with the service messaging. So could you please maybe touch upon some of the other technologies using addition to Kubernetes, maybe what you're using for for logging or for authentication if you're allowed to say that so maybe some a bit of like our core parts that you use in your platform. Yeah, so I mentioned Kubernetes a lot that for deploys we're using spinnaker for service mesh, we're building on top of Istio and then for our logging stack we're making a lot of use of the elk stack. And for, I'll talk mostly about our internal services service, like authentication so in our service mesh, Istio comes with Spiffy, which is a kind of authentication framework and allows all of our services to communicate through TLS. This is actually great. I'm really excited to hear that Spiffy is actually used in an organization. And what I'm saying that is we had a secret management radar, which was in the previous quarter, and we didn't have Spiffy or Aspire on the radar, mainly because it's still emerging technologies. So it's really great to hear that an organization like Airbnb already integrates it to have this secure communication between services. Now, you're talking about one touch and the first time I heard about one touch and this obstruction you have on top of Kubernetes was during KubeCon North America in 2018. So that was in Seattle a long time ago it feels like a very long time ago now. So, you've mentioned that this is an obstruction that helps your developers to deploy services. But I would like to ask a question of how it actually impacts the maintenance of your clusters and services, like, is it easier for you to deploy changes or even completely roll out a service or delete a service. And another question I have regards to this, how does it impact the immutability of your infrastructure and clusters? Sure. Yeah, so I'll talk about how we can move services between clusters a little bit. So I mentioned earlier that in each application or service repository, there's the info folder and there's the inside of the info folder there's a kubegen configuration file. And so the structure of the kubegen file is that there's multiple environments for the app. So you might have the production development and staging environments. And in each of those environments there's one single YAML field called context and context is basically the cluster that this environment gets deployed to. And so when our kubegen CLI reads in the kubegen file, it generates out all the all the manifests and then during deploy time, that context is checked and then sends it all those manifests to the correct cluster. So instead of having to specify that cluster multiple times, just have that single field. And so because the kubegen file is also just plain YAML, it makes it easy to run automated refactors across the service repositories where you can change the context of environments. So you can almost automatically move services between clusters. First, you can deploy to the new cluster and then gradually scale down to the service in the old cluster and we can also even run automated canary analysis. So we run like one replica in the new cluster, one replica in the old cluster. And they both receive production traffic. And then you can compare to make sure that the services running normally on the new cluster before migrating the rest over. And so this, like in practice, we're not moving services, like we're not rotating through clusters, like every single day because we're running, you know, tens or hundreds of services in a given cluster. But what this lets us do is when we're doing some more challenging migrations, like when we're switching our CNI provider, our CNI plugin, then we can create a new cluster with the new CNI plugin. And then move our services over, you know, one by one, and then gradually, like, spin down the old cluster. And so it makes these kind of transitions of like cluster setup go from something that's really risky with a large block radius to something that we can do gradually. Now, this sounds really, really cool because migrating between clusters, especially it seems like it's in an automated fashion as well. It's challenging, but it's like you already have the processes building house to allow this allow this kind of operations. And another question I have, which is going to be next is one developer experience which is going to be tied into how the services are delivered. Now, with Airbnb, one of the talks that I've seen is pretty much it was mentioning that you've migrated from managing hundreds of services to hosting thousands of them in less than three years. And I would like to maybe question how this impacted the developer experience and maybe do you have like new methodologies to troubleshoot, maintain, debug an application when something is going wrong there. Sure. Yeah, so the journey of building out that SOA was a pretty massive one over multiple years. I think back in like 2015, that's when we really started that effort, earnest. And as I mentioned, we were exploring Kubernetes around that time as well. So, basically at that time we knew that there would be lots of services that needed to be created and maintained and no one knew how exactly how many but our goal was to initially to allow service owners to create a production ready service in just one hour. And this required a lot of that consolidation efforts that I talked about at the beginning, which is like moving from having multiple tools repositories that service owners had to edit and deploy into one single application. So that's why we call it one touch because it's just one one place right to install your your code and configuration. And you're mentioning the second part of your question right, could you remind me about that. So, this is pretty much the developer experience that you've covered so far. And I'm curious if, for example, as an engineer I'm using one touch to deploy my application. But what is the process if I would like to troubleshoot it or maybe verify if it's running in the right clusters with the right amount of pods, if I can get the logs, like what are those troubleshooting procedures internally. Got it. Yeah, so for that we have a command line tool just, it's called K. Which is, you know, similar to a lot of people's aliases for give CTL but in this case, it's our own CLI that wraps around a lot of common workflows that you mentioned like exacting into a pod for interactive debugging or like removing it from service discovery. For that kind of interactive debugging while making sure that it's deleted later checking the logs. And some of the different, like, the reason why we have this is because we want a more kind of application centric focus on the CLI so some of the arguments that you can provide to our like the app and the environment. And then we derive the namespace from that by concatenating the app and the environment together. That's really cool. I'll definitely like to try to try it out. Do you by any chance you maybe have a version which is open source and open source and the listeners can verify by themselves or they just have to inspire themselves from from what you've been telling so far. We've been talking about that for a little while and considering it. Do you have any insight into the current thoughts on that. For one touch or the K tool. Yeah, we haven't really pushed forwards on that because I think there's a lot of duplication between K and QCTL and a lot of that the sort of functionality provides is very specific to how Airbnb, like, you know how we how we manage namespaces how we manage applications. And so there's a little bit of like proprietary stuff there. I've spoken in the past about making elements of one touch open source although I suspect the industry has moved forward since, because I think at the time was relatively novel way to manage application resources. So we'd love to, but not a huge amount of progress yet. And I think another, you know, there are other different ways that that keeps you now allows you to integrate which is like with the plugins, which, which are like another good way of adding your own workflows around it. Initially this this started as kind of like scripting around QCTL. But now we've also got like many, like since it's a full fledged CLI we've got plenty of other workflows for example, we also use K to allow developers to get access to their Kubernetes clusters to get their authentication credentials. I wanted to mention before that if there are still any kind of intentions to open source that I think the community is going to look forward to to these tools because I think there's so many organization that are still at the beginning of the journey to the cloud native I think this definitely going to be useful to some to some sectors. So if you've mentioned QCTL plugins. I think again this is a great way to personalize and tailor the way you consume QCTL commands and the way you interact with your cluster. So definitely kind of a great usage at Airbnb and it's great to hear about this one. Another point I wanted to make about the developer experience which is a bit maybe less technical being developer centric is quite important it definitely enables the power with the engineering team to deploy their services, but I would like to ask if having a good developer experience maybe impacted the culture internally at Airbnb maybe some of the practices, or maybe did you had any changes in creating the top talent, like, do you think this impacted these areas in any way. Like I mentioned that one of the benefits of separating out the management of infrastructure and product teams is that there's a there's the ability to specialize right so if you're on a product team or service owner you can. Now you can more confidently deploy your changes to production. And because we have a more modernist and managed infra. And there's less time that's required for firefighting and distractions on reliability issues. So that allows our product teams to achieve a better velocity. I think the reliability story is really interesting too because one of the benefits of, like, the reason organizations move to microservices based architecture is because it really helps your engineering team scale right like Airbnb has, I think at this point thousands of engineers. And the monolithic application I think wasn't working well it's at the time, just because of the velocity of work that people are doing. And so we move to the service oriented architecture but maintaining a common standard for how we want these applications to be operationalized like how we do alerting, how we manage services and production how we auto scale things. One of the big benefits of one touch for Airbnb was that we were able to enforce these common standards across all applications and we have one central place to introduce changes. And things like upgrading dependencies are a little bit easier because I mean harder in some ways because there are many applications to upgrade at a time. But we also had a central way to kind of make sure all the services are moving forward and not step and kind of meeting our latest standards. And that's been pretty powerful I think, because all our services run on Kubernetes, almost all of them, and I think almost all of them are now order scaled or right sized in some way, which is really, really helpful when it comes to, you know, adapting to changing load. You know, first of January is a busy day for Airbnb because people like booking vacations and so you know the clusters just the tooling takes care of that for us we don't need to worry about it. And I think that's a big part of our like maturity is not sharing organization. Oh, that's really cool. Thank you for giving like such a foreign introduction of your clusters and the deployment process to to the platform. Another question I have. It's more about the future challenges so do you feel like at this point there are any challenges that you're going to face and building your clusters or maintaining your clusters or deploying your applications, or maybe there are some new technologies that you'd like to adopt and build on your radar at the moment. Yeah, so some of our current challenges are around how we deploy services to like multiple clusters and rethinking some of our fault domain so right now, all of our clusters run across multiple availability zones, and we deploy one single service environment to to one cluster. One of the things is that the cluster right now is still the fault domain for a particular service environment. And so we want to rethink a little bit about that, because one of the problems we've seen with running clusters across multiple zones is balancing replicas actually across the zones evenly. And with, you know, things like topology spread constraints. That's become a little bit easier. But we still want to maintain a really even capacity in case, you know, our underlying cloud provider loses some, like some capacity in a given availability zone or also to avoid traffic going between those zones. And so we're thinking about, you know, how can we restructure these clusters could we maybe one run one cluster in the single availability zone, and then deploy service to service environment to multiple clusters but then, in order to do that we need to have a strong idea of our distribution story. So how are we going to abstract away the underlying set of clusters from users, and allow them to maybe just specify some of the constraints like maybe they would like to run on one set of hardware. But they don't have to worry about whether it's going to, you know, prod A or prod B or prod C. And again, like, sounds really exciting and hopefully going to share some of these thoughts when you actually implemented during different talks and sessions at KubeCon. And the next section that the next set of questions that I have are, of course, around your KubeCon and cloud native participation because Airbnb in the past have given many talks and even keynotes during KubeCon. So the keynotes that I was talking about and refers to to untouch we've covered this quite, quite heavily in the in the first section of the stream. However, there is one talk that you've delivered Stephen, which is did Kubernetes make my P 95 sports. Now, could you share Airbnb's journey on performance gains and losses, and its mass migration to Kubernetes. Sure. Yeah, so, you know, that's one of the one of the challenges that came with making sure that all of our existing services adopted Kubernetes as well as our new ones was making sure that developers had a good sense of whether their application was running faster or slower when migrating. And then we also encounter lots of like interesting performance regressions, which we shared in that talk. And so some of the gains that we saw in terms of performance were some of them were around like efficiency of resource usage. So, you know, we had more uniform provisioning because of the impact that we were able to do. We're able to, you know, we're able to enforce a certain percentage of resources are actually being used on the nodes and then auto scale up the number of nodes in our cluster when we hit say like 85% utilization where where I define utilization as the all the requests of the pods for like their CPU and memory compared to the CPU memory offered by all the applications. And so you can compare this to like previously when service teams were in charge of their own their own hosts, not all of the services were auto scaled, those led to like some pretty inconsistent provisioning some some services were massively over provisioned and others like when traffic increased a small amount they would have to rapidly scale up. And so that's that's one of the things that we got, which was the central control over how we're composing our fleet and how we're how we're been packing. And we can also upgrade all of our, like, hardware to latest generation, which, again, was not possible with each individual team managing their, their, their hosts. So that's an easy win. And then for some of the losses right because we're we're running multi tenant, we've got interference so we had to do a lot of research with regards to CPU limits and latencies for how is it better to set a CPU limit equal to the request set like a really high CPU limit or no limit at all. And currently we're not recommending that user set CPU limits, but we still want to learn on utilization relative to their the requests. Now this is like a very short overview of what's even should hear. We will try to point to the actual talk as well, which was given at keep calm. And enough for a session that I want to mention, and scaling has been mentioned quite heavily today throughout the entire discussion. And one of the tools that was given was scaling you really just the thousands of nodes across multiple clusters, come now, this is I think quite an important characteristic of managing clusters because when you scale when you increase the amount of infrastructure usually calmly is not something you, you would introduce in that situation. So within this talk, pretty much it describes how Airbnb scaled from 600 to bring these clusters to 5600 nodes to 5000 nodes and 10th of clusters. Maybe could you briefly share how, how you kind of completed this migration, maybe some of the challenges that were faced during this migration, and the specific approaches that you could recommend to the listeners pretty much everything in this context. So, yeah, a lot of this talk was motivated around our journey from running one single production cluster, which is like our initial attempt into breaking that into dozens of clusters. So, you know, we learned a lot of things about cluster scalability with that first approach that I mentioned, like we had to understand at CDs scalability, like events, like scheduler algorithm efficiency, like some keep doing s issues and lots of lots more so we've shared some of those stories individually and other talks like the talk of did Kubernetes make my P 95s worse, as well as kind of a series of talks that we've got called ways to blow up your Kubernetes cluster. And so, yeah, pretty much pretty early on, when we were migrating our services over we realized that we would need multiple production clusters. But, but because of that initial experience we had good guidelines around how big to make each production cluster so currently we have, we run each cluster tapped around 1000 nodes and then we have other limits on things like pot update rates and like endpoints per service and we follow the guidelines from six scalability pretty closely. And another, like besides scalability of our clusters. We also touched in this talk on provisioning automation and speed so our first clusters were set up in the style of kind of like Kubernetes the hard way so very hand rolled. And so, for creating many clusters we're looking to raise the level of abstraction. So we looked through some of the existing products at the time for bootstrapping clusters like cops and QBADM. So ultimately we decided to create APIs that were inspired by these projects and then we wrote scripts underneath that generated our own configuration. So that allowed us to integrate with our existing VM management and so some of the key ideas. From that talk one that you want to have an API to describe your cluster state, which is still evolving in the community and then do we want to group similar clusters into a cluster type and cluster type just means like common configuration that doesn't really cross multiple clusters. And so in the future we're hoping to draw some analogy lines between, for example how replica sets specify the number of clusters and deployment manage rollouts of cluster changes so you can actually imagine changing that API into some sort of resource that some sort of custom resource that allows for smooth rollout of new changes across multiple clusters. Sounds like operators are still a driving force when it comes to maybe day two or even day three Kubernetes. So again, if you if you're going to build something around this actually the talks that already have been delivered in this topic they have great contents I definitely would like to encourage everyone to watch those. And if there's maybe further work on this definitely be great to hear about it. Now, going back to keep going cloud native con North America this year. Maybe Sunil can have some inputs here as well in addition to Steven. What are you looking most forward to explore during keep calm which is going to happen in three months. I can share one thing. So, we're at this point now where everybody is running almost all stateless services on Kubernetes. And I guess the infrastructure team at Airbnb is getting really excited about running stateful things on Kubernetes, especially as we start thinking about how we manage our infrastructure at a large scale different regions, different availability zone and so on. So we're really interested in how we can onboard stateful services. You know, this is things like online offline databases, other distributed systems things like Kafka song people have started doing this in the industry. But now it feels like people, you know, other companies are starting to reach the stage where they're running these things in production. So, one of the things we're really interested in learning more about how people are running stateful services and communities. And you know how we do that in a reasonable way in a safe way at Airbnb scale. I can also mention that from the service mesh side of things we're interested in some of those recent efforts on the part of Kubernetes so like their, you know native Kubernetes resources that are in works to enable easier integration with service mesh. Multi cluster services, for example. So, you know, initial service mesh efforts they kind of piggybacked on existing resources like services and endpoints but hopefully the two efforts can meet each other. A little bit. And I'm looking forward to the efforts that the community has worked towards. I want to hear about this one. I think there's like definitely new collocated days during coupon and some of them are going to be focused on managing data and maybe some of them can be quite insightful into managing stateful applications. And another cool topic that Steven mentioned of course is how can you how can you use service mesh across multiple clusters and actually make sure that the services in different clusters communicate between themselves securely. And maybe taking a step away from Airbnb and taking the hat from Airbnb hat and putting the community hat here. And kind of predictions you have in regards to emerging themes and technologies within the wider ecosystem and this can be completely unrelated to your current work at Airbnb, maybe something you're excited to to know more about or to see developing in the community. Yeah, so one of the things I'm personally excited about is learning or watching identity and authorization projects grow and gain more adoption so for example. The open policy agent recently graduated from CNCF and there's more more adoption that it's seeing. And I'm excited to see how that's going to be used not just as a mission controller or for like service service authorization but as a component that people use for general policy enforcement across all their infrastructure projects and the identity side of things. I'm looking to forward to seeing how spiffy and inspired can see widening adoption across the stack not just for services service authentication but also maybe like user authentication and access to services or infrastructure. So if you could, if we could use some of these projects and extend them to allow SSH permissions or other things like that. So Neil, do you have maybe your insights into emerging themes within the cloud native ecosystem. Yeah, I'm a little out of touch with the community but I know there are a set of startups that are looking at this idea of runbooks as code, which I find kind of interesting. You know, fusing tooling to augment your own call engineers ability to investigate issues with clusters. That seems really powerful to me because you know, at a company like Airbnb we have a large engineering team and it's not really sustainable for our team to be involved in every incident to do with a service having issues and so anything we can do to programmatically enrich data that goes to users is really helpful. So that's the direction I'm really interested in. You know, there's lots of challenges there because in order to do that well you have to integrate a lot of different providers and systems, but and you know permissions and like how way you host this data is really kind of interesting and confusing but I think that's something that has a lot of potential to make on call a lot easier for people managing lots of customers. I'm definitely curious to see these areas growing within within the ecosystem as well. The last set of questions I have is in regards to your experience as an end user. Now Airbnb is a CNCF end user member quite recently actually they joined a couple of weeks ago. And I know it hasn't been too much but still I'd like to ask about your experience of being a CNCF end user and your experience with communicating or reaching out to the community adopting tooling and so forth. So I think in general the community has been really accommodating and welcoming. And so like project maintainers are always ready to discuss our requirements and and any issues that we bring up. And if they feel like, for example, like we raise a request and they think that it's better to be fulfilled outside of the project or like it's not likely to be prioritized in the near future they can communicate that as well and then we can work together to either find some extension mechanism or know that that we're going to build our in house solution for the time being as well. So Neil, do you have any thoughts on experience as end users. Yeah, it's been pretty great so far. I think I'm really excited to see how much more we can do in the future. I mean it's been nice having kind of the ability to communicate directly with all the members of the community and even though we were kind of open to talking to the companies before it's kind of like explicit like, you know, hey, we're, we're open to sharing, which is great. I already had a couple of LinkedIn conversations with people in the community like, oh hey, I see Airbnb is now a member of this community. We'd love to talk more about communities. So that's been great. Awesome. And, and everything that I one of my last questions actually, I know that Airbnb has been quite active in outreach to the community in provisioning a lot of talks around how you set up infrastructure and deploying your application and so forth. And one of my question is, how do you think end user organization can contribute and give back to the ecosystem. Do you have any best practices or maybe recommendations around this topic. One of the great things is that everyone, just the core maintainers can file bugs, bug reports and patches as well and even feature development. And so that's some of the things that we've been doing regularly. We also tried to attend working groups and special interest groups to redesign documents as they're they're in progress and mentioned our use cases and requirements so that we can motivate specific solutions. And then we can also discuss those extension points I mentioned which allow projects to be decoupled from business specific logic or policy. And so this allows for wider adoption of these these projects and features. And it also allows for some of the maintainers to know which which features different companies are using. So they have a better idea of what to prioritize as well. Awesome. So Neil, any last thoughts on how end users can contribute back to the community. And I think, yeah, Stephen covers it pretty well. The big thing for us is really pushing code upstream as much as possible. Because we do run some interesting edge cases with the projects we use, just as nature of our sort of scale and set up. And I think it's helpful for us and also for everyone else to get more eyes on our code and like to really upstream as much of this stuff as possible. Because, you know, it just reduces the maintenance overhead for us but also really gives back to the community so that's something we're definitely trying to do more of in the next few years. Awesome. Well, I'm looking forward to all of your contributions, be it in code, be it in talks being get outreach to the community. I think these are all great ways for everyone to reach out and I think Airbnb is doing a great job of doing those so far. Now, these are pretty much all of my questions for both of you today. Thank you for everyone who joined and listened to this stream from the CNCF end user lounge. It was great to have Stephen Chan and Sunil Shaq from Airbnb talking how they manage hundreds and thousands of services on dozens of clusters. Just as a reminder, we try to bring these latest cloud native end users stories on every fourth Thursday of the month at 9am PT. And everything I'd like to mention is don't forget to join us for KubeCon and cloud native virtual actually hybrid North America, which is going to be in October 12 to 15th. I would like to showcase your usage of cloud native tools as an end users. You can join the end user community and you can find more details on the CNCF.io for its slash end user. Thank you for joining us today and see you next time.