 Thanks for coming to my talk. Good afternoon. My name is Tim Leong. I work for Comcast. And today I'm talking to you about how we overcame the challenges of 31,000 AIs, which is now actually 40,000 AIs. So for anyone in the room who's kind of tired of hearing about Comcast and all of our victories, you're in the right room because I'm going to tell you how we all screwed it up. But no, that's not true. For real, we've had some missteps along the way over the five years that we've been running this platform. And some of those decisions that we've made have culminated into a series of unfortunate events that we experienced last year. So I want to take some time, talk you through what we went through, some of the mistakes that we made, and how we came together as a team, not just the platform team, but a full team of platform engineers and developers, how we all work together to make the platform a success for everybody and move past these challenges and set us up to be much more successful in the future. So first, a little bit about us, more the APA team application platform acceleration. It's our mission to try to simplify the lives of the developers and help them to deploy software faster and more easier. And we do that through our platforms. Obviously, we support Cloud Foundry, which is a huge benefit to our developers. We also have our Apex platform, which is a Kubernetes based platform for other workloads, which you might have heard about right across the hall. We have a concourse deployment for folks who need CICD. And then we have our DevX team, which is responsible for direct customer outreach and consulting for best practice on cloud native. So this is us today. But if you take us back to 2014, we were just starting in this world. We were just getting started with Cloud Foundry. We were deploying our first Cloud Foundry installation. I think it was like at 1.2 or something like that when we started. And it was right around the time that the ice bucket challenges were happening online when they all went viral. And we were going through sort of the same process. It felt a little bit like an ice bucket being dumped on our heads, this new cloud native world. And it felt that way for a lot of our developers, too. Our developers were not used to this new platform. It was odd to them. They didn't really understand how it worked, what the architecture looked like. It was very strange that they could just push an application in and get it to run. So our team was really focused on making sure that this platform was a success, and that they could see the benefit that we all knew it provided to application developers. So when we started, our method was to focus on growth and adoption. We wanted to make sure people onboarded onto the platform. We wanted to do whatever we could to make sure that happened. We focused a lot on marketing and training. We actually printed these posters and put them all around the Comcast Center across the street. They're probably actually still there. We tried to limit roadblocks and restrictions as much as possible, so we didn't really ask too much about what are you going to need that quota for. We were just happy that you were using the platform. And we tried to help you do that. At the time, a lot of development teams were trying to get into DevOps and Agile, and we used this as a vehicle to promote those things and say, you can have these things automatically in a large part if you just go onto our platform. And then we just basically built capacity as fast as we could to keep up with demand. So we just kept on rolling in, rolling in, rolling in more servers to try to keep ahead of that curve. And it worked really well for us. We had 300% growth year over year for about three or four years. As you can see, the comparison between 2017 and 2014 as far as application AIs and developers go. And we were able to stay ahead of it with our capacity and our demand. And we were able to keep ahead and deploy infrastructure just in time for that next capacity or resource utilization bump. So I'm sure you can understand what happens next. Basically, 2018 comes. Everybody is really good with the platform now. They can deploy microservices in days. It doesn't take long for really high resource-intensive applications to get onboarded onto our platform. We still were sort of operating under that same model of we're really happy for you to use the platform. We're just going to deploy infrastructure to support it. And 2018 felt really like just like a year-long Black Friday sale. Just people grabbing capacity as much as they can. And we were trying to keep up. And what we saw were incidents like these where this isn't a real performance graph, but we would see on some of our largest clusters, we would see a 40% bump in resource utilization within two weeks, something we were not at all prepared for. And then, luckily, we had infrastructure that would come in afterwards. And it would bring the utilization down. And then two weeks again, it would happen again. It would bump up again. So that was a problem. And we had several impacts of this. We had app resiliency failures. Once the performance got or once the utilization got to a certain point, health checks started timing out. Applications started crashing. There was a lot of reduced confidence in the platform. People started to wonder if this is the place where they should go. And we were 100% focused on operational issues. So we weren't able to really focus too much on improvements or the things that would help us move this platform into the future. So that was our year. So what did we learn? What do we learn from all of this? The first thing we learned is that forecasting is hard. Forecasting is really hard in this world. But when you're living in a dynamic environment like us, like we are, you have constantly changing business needs. So the business is always bringing in new initiatives. These things aren't necessarily always planned for in time for you to purchase and deploy hardware in order to support these things. Some of these workloads have really unpredictable CPU spikes. Interesting story, we used to model all of our capacity off of a memory demand, because a lot of our applications were really memory bound. And at some point that shifted. And we saw a much higher usage in CPU spikes as opposed to memory. So we had to look at things differently. And that was a challenge because CPU doesn't really have a quota system in Cloud Foundry. So it's hard for us to control those types of workloads. And just in time infrastructure, basically you try to buy the infrastructure you will need six months from now. And when that six months comes, hopefully your infrastructure is there just in time for you to deploy it and be in time for that capacity home. So the thing about this lesson is there's not really a lesson here. Forecasting is hard if you work in a business where these types of things exist. These types of things will probably always exist. So if you're not in a model where you can do charge back, or you're in a dynamic environment where workloads are always going to be changing. You might not be able to fix any of these three things. So what we had to do is design around it. So we had to figure out a way to know that this is happening, know that we are going to see this in the future, and how can we design better? And how can we run this platform better so that we could be prepared for it and respond to it in a better way? So that brings us to lesson two. Our design choice in the beginning was whenever we had capacity needs, we would basically just keep on adding infrastructure and infrastructure and infrastructure to a relatively small amount of sites. I think at the time we had maybe five or six main sites where a lot of our workloads are running. And as these were growing and growing and growing, what we were doing was we realized what we were doing was we were basically just creating another monolith. So you have just a large stack of microservices, but they're all codependent. And they're all very dependent on a single set of infrastructure underneath. So if you have any infrastructure concerns underneath that cause failures, all of those containers are going to have impact. And all of those containers are going to have to move somewhere else. And not only is moving all of those containers a seriously expensive task, the capacity required for supporting those containers on one other site is going to be a big problem. So we decomposed it. So we basically said, we need to support many smaller sites. We need to avoid the micro service monolith by distributing applications across many more sites so that if you are having problems with one site, it's going to reduce the blast radius. And it's a lot easier for five sites to absorb the capacity of one versus one site absorbing 50% of the capacity from another site. So smaller sites were definitely a big thing. And ancillary benefit of that is we're trying to perform a multi-cloud. So if you have a lot of smaller sites around and applications are distributed across five or six sites, it's a lot easier for them to go into an Amazon and be prepared for any situations where you might need public cloud. So that was a big lesson for us. Lesson three is we need to find meaningful KPIs. So what we mean by that is when you get started with the platform, you start looking at things like CPU utilization. You start looking at things like how many AIs do I have? What is my overall usage? What's my memory usage? But what we need to find with our platforms is the hard limits that you can operate under. So for instance, when we look at our CPU workload, our CPU utilization, we know that our hard limit is 125%. So if we start seeing CPU utilization going over 125%, we know that the applications are going to start seeing impact. We know that they're going to start failing. And we're going to start getting calls. So not that we're trying to size to 125%, but in the event that you need to fail 100% of your applications over to one site, you need to make sure that you are staying well below that hard limit. And you need to know what that limit is. Another good metric we found was the number of AIs per core. So when you're thinking about building more capacity, building more infrastructure, it helps to know how many AIs are you going to be able to support with all of those cores you just added. So if you have a big rack of servers and it represents 600 cores on the CPU, you know or we know that's going to support 1200 additional AIs. And we will not go over that limit. So that's a good metric to go by. And these things just kind of come from trial and error. So there's no real magic way for you to figure this out because all workloads are different. All CPU profiles are different. So it's really important for you to work with your application developing teams and understand those profiles and try to get as close to an accurate number as possible. So when you are going over a limit or approaching it, you want to know why. You want to know why you're approaching a limit. So it's not necessarily inherent in the platform or included in the platform to understand where this resource usage is coming from on an org or application basis. So it's really important for you to have those metrics and to be able to see them on a historical basis. Because your application profiles and the performance profiles are going to change. And you're going to want to talk to those people and understand what they're doing. And then also knowing if they're distributed across sites because there are a lot of applications that we found that are really banging against one environment. And they're actually not distributed across multiple environments. So being able to report on that is also very important. Lesson number four. So remove inefficiencies. So one of the things that we learned out was we needed to focus on the core functionality of the system. So Cloud Foundry is an application runtime. It's there to support the application developers pushing their apps in and running their applications. There are a lot of developers that had come to us over the past few years. And I'm sure some people who run Cloud Foundry have been in the same boat who say, I'm losing my logs. I can't get log loss. I must have these back. You need to scale up your logger gator systems. And at one point, we were going through that same chase. And we found out that in order for our logger gator system to support all of those logs, we would have to deploy somewhere on the order of 64 VMs. 64 VMs represents more than a quarter of the amount of VMs that are supported by Diego. And we came to the conclusion that there's got to be a better way to do this. And application developers do have options for doing logs. And this is actually our current log loss that we see on some of our busier sites. And that's OK. Because if you really need your logs and it's really, really important to you, then maybe you shouldn't try to print a standard out. Maybe you shouldn't try to use the logger gator system. Maybe you should log directly with some sort of logging framework within your applications. So it's really important for us to remove inefficiencies and reduce the amount of capacity that is being sucked up by sort of these ancillary services that are provided by the system. This one might seem obvious, but separating Dev and Prod, this is something we've done since the beginning. And at the start, we were trying to figure out if we should do this, or if should we just allow everybody to live in the same site. This is the cloud. Why not? But what we found is that these types of workload spikes, where you're running about 400 requests per second, and it all of a sudden jumps to 3,200 requests per second, that's going to impact your production workloads. So you don't want this kind of workload in your production environment. So it really would help if you separate those out. And then that also has the benefit of teams can be a little less responsible with their log verbosity in the non-prod environments versus the prod environments. Reducing AIs was a big benefit for us. We actually had the benefit, I guess you could call it a benefit, of being able to rebuild entire environments at the end of last year. And before the rebuild, an environment was running 5,000 AIs, and after the rebuild, it reduced by 2,000. So application developers will get in there. They'll push stuff. They'll just let it run. And it'll just stay that way forever. And as platform owners, it's really difficult for us to be able to tell if those are legitimate applications or not. So after we did the rebuild, we were able to reduce our AI footprint by 2,000 just after the rebuild. And that's where it pretty much stayed since. And then obviously you want to report all this back. So all of these metrics are available to all of our developers so that they can see what's going on. And we try to do a really good job of working with them to try to understand where these inefficiencies are and what they can do to help. Let's see. Last one. So lesson five, and this one is probably the most important, is trust and partnership. So one of the things, when we started with Cloud Foundry, we really saw a lot of the developers that were onboarding as customers. So we were operating sort of as a little startup. Anybody who came onto the platform, we had to try to make sure that they were happy so that they would stay on the platform and make it a success. But as we moved on, we saw that partnerships were much more valuable not only for us but for the application developers as well. People who will work with you to help move the platform forward. And this has been really important for us because when the platform was having issues, that our app devs didn't just lay it on us, they worked with us. They asked what they could do to help. How could they spread stuff out? How could they make this more secure? Because they really do love the platform. So I think that working together is sort of mutually beneficial for everybody. It helps to be extremely transparent in these kinds of situations. So being upfront about the problems that you're facing, letting everybody know exactly what's going on, and just being really honest. Building that trust with your partners is key. Platform visibility, we make sure that all of our metrics are consumable by everybody. And we do see some app devs pointing out some of the platform metrics, some of the platform KPI's that we might have overlooked. And they come back to us and say, hey, have you looked at this? Looks like your AI counts are down by 100. Maybe there's some kind of issue going on. So that's been really helpful. And it helps extend that whole partnership relationship. And we try to promote active communities as much as possible. So keeping that dialogue open, making sure that everybody can communicate about all of these issues, our community, I think you heard about it before, we have like 1,500 Slack users. And all of them are communicating about stuff. All of them are helping each other out. There's been several people who have answered questions in our absence while we're at the summit this week, so just having that active community is really, really great. And that's pretty much it. So just to summarize, make sure that your applications are distributed. Avoid that monolith and deploy your applications to as many sites as possible so you can reduce that blast radius. Find the KPIs that matter to you. Understand your hard limits. Understand where your platform is going to start to fall over. And also who is causing that to happen. You need to promote the core functionality of the platform and reduce as much waste as possible. Again, our attitude is that logging isn't all that important when it comes from the platform, and there's other ways around it. And make sure you have partnerships and not customers. Just work with your app developers, try to have that mutually beneficial relationship so that they can help you advance the platform just like you're helping yourself. So that's it for this. I just wanted to take a quick mention at our open-source platform, our open-source Comcast. We have such a really great team. I think you guys have heard this several times, probably in several different talks, but it really is the real deal. It's cool to have this organization help us promote some of our projects into the open-source community. And it really is refreshing to see how committed Comcast is to it. And how we work, Comcast is a great place to work. I've been there for 15 years. I love it. I love the opportunity. I love that we're on the cutting edge of technology across so many different organizations. So if you guys are interested, I really, really recommend you talk to some of our talent acquisition people at our booth. So that's it. Thank you. Any questions? It's a lot of both. So I'd say it's a lot more on our developer side than we do have some platform projects that have been submitted. We have some build packs that we've submitted, a Bosch release that's been submitted. So there are some contributions by our platform team, but I would say a vast majority are from our app devs. It's all at the discretion of the user, really. I mean, if an app developer usually knows what's best, so if he or she feels that a project that they are working on would be beneficial to the community at large, they can submit to have it open sourced. And they would work with our open source office to figure out how that can get done. We do, yeah. Any other questions? Well, I mean, we definitely promote Cloud Foundry as an application runtime. So I think that's its chief concern. If folks are looking to run really high-scale data platforms, we try to steer them in another direction. So we do have data platforms that are on Cloud Foundry that are there sort of as a convenience or for very lightweight data workloads. But on the whole, there's lots of options within Comcast. There was just a half an hour ago, Colby and Brett talked about our CFCR platform where teams are running Kafka through, Mongo, some other things like that. So I think that's the direction we would try to steer them in. Well, yeah. Well, we are. Yeah, I mean, unfortunately, a lot of the tools that are inherent in the platform are a little bit myopic in terms of looking at a single org, a single app, but seeing the entire forest and being able to correlate that to the top 10 orgs, for example, and have historical metrics, we kind of had to roll our own on that one. Collecting logger-gator statistics out of the fire hose, correlating that back with metadata that we pull out of the API. And then building Grafana dashboards off of it. That's kind of how we roll. So it can be a combination of two. So we do mostly foundations. So we have separate foundations that we support. If we're in Amazon, we do have a multi-AZ architecture per site. But when we're talking back in private cloud, we just deploy separate sites. And then we ask for our devs to deploy to those things separately. If I deploy to site A and I'm writing to the database that's at site A, site A goes away, is it the developer team that needed to make sure that there's some replication happening at site B? Yeah, yeah. So I think we do put the onus on them a little bit on that one. We do have some folks that are on our team that can help on an advisory layer. So if they need help trying to figure out how to do that from an architectural standpoint, we can help them. But we can't really deploy those things. Can't solution it out for them. Do they own low-balance things? We do. So we actually, and I see here, a member of my team, Dil, had developed a GSLB as a service that sits in our marketplace. So anyone who, and this is part of the reason why we promote this so much, is because we try to make it as easy as possible for people to distribute it across sites. So we have a service that interacts with an F5 API and allows you to create a GSLB instance across multiple sites, as well as certificate management through a different service. So that works out really well. And that helps make it a lot easier for us as well. Is that planned open source item? It's not. I'm sorry. A lot of, unfortunately, we have our networking team. Bless their hearts. They allowed us to interact with their API to create these things, which is pretty much unprecedented at our company. But they're really good people. Our networking team is great. So I mean, they're really forward thinking. I apologize if yours is not. But the reason it's not open source is because they actually put a service layer in between. So a lot of the API calls are proprietary, I guess. They wouldn't really be useful to anybody. That it? All right, thank you.