 Good morning. Excellent. Can you hear me okay? At the back? Barry? Good. That's better. All right. Good morning, everybody. Welcome to the panel session. OpenStack Operations, Resource Management and Capacity Planning at Comcast. We are about a third of the operations team and one of the engineering team from Comcast and we're gonna tell you about our experiences running a production grade, large scale, nationally distributed OpenStack infrastructure, supporting a very large range of tenants with diverse workloads. So we'll just start off with some introductions. My name is Steve Muir and the director of the OpenStack engineering and operations team at Comcast. First here is Sheila. Sheila, would you like to introduce yourself briefly? Sure. My name is Sheila. I've been with the Comcast cloud team for almost three years and with Comcast for five and I'm on the ops team. Rich. Hi, I'm Rich. I've been with Comcast just over a year now, and I'm also on the OpenStack ops team. I'm James. Hi guys, I am us. I'm James. I'm on the storage side of things. I just joined about six months ago, so I'm pretty new. All right, so we're gonna start with a few high-level questions just to kind of get the panel going. It is a Q&A format, so if anyone wants to ask a question in response to one of our sort of preferred questions, that's fine. Just raise your hand and we have a mic. We can maybe pass the mic through the room to make that a little easier. So let's go to the first question. I'm gonna go to Sheila. Just to give a brief overview of the OpenStack environment, we have a Comcast. Sure, so we started the Comcast private cloud in 2012 and that was in the summer. In the fall, we had our two production environments. We were running on Essex. Since then, we have over 20 data centers and environments running in Havana and Ice House, and we've got tons of internal applications running on the cloud. Some of those applications are Xfinity Share, parts of X1, and our internal conferencing center. Thank you, Sheila. Rich, you're involved in some of the capacity planning aspects of our cloud operations. Can you talk a little bit about how we do capacity planning for the cloud? Sure. So we try to find out who our big customers within the company are going to be. Sort of our X1s, the big external customer reaching projects within the company, and try to gauge from them what they want to do on our cloud. And then we try to make sure that there's enough room for as many of those big projects as we can fit into our cloud. And just for folks that aren't familiar, X1 is the Comcast set-top box. It's a IP control plane UI running off of a Java server application. It's the largest tenant in our cloud. And we certainly have a very significant number of other tenants as well, so it's a real challenge. So sharing resources amongst all those clouds. James, can you talk a little bit about how we do storage to support that workload? We do a lot of storage. When I joined, they tacked another couple of zeros onto every number I'd heard before. So for storage, we use Ceph as our back ends, and we do that for object storage as well as block storage. We also use, we're implementing a small solid-state storage right now using a vendor-supplied product. Okay, thank you. So one of the interesting things about running a multi-talent cloud is obviously figuring out the right level of resource utilization to aim for in the cloud. So, Rich, could you kind of talk about what some of the key metrics we use for measuring utilization and setting thresholds for when we sort of deem a particular data center to be full or at capacity? Certainly. One of our biggest resources that we have to keep a very close eye on is memory usage. That's almost certainly one of the first things that gets used up very quickly in our cloud. But besides that, we also keep track of cores being used, ephemeral disk space being used, the actual disk space on the compute nodes themselves, and IPv4 address usage. And we determine the environment is full, meaning if we add anything else to it, it might start slowing down that region. When memory usage hits 70% and at that point we stop adding users and projects to that specific region of our cloud. Sheila, could you talk about how we handle situations where customers come and ask for resources that just aren't available? How do you politely handle that situation? Well, we look at the numbers. We pull data pretty much every single day and we keep track of the different environments and we know exactly which environments are at what capacity pretty much all the time. And then when we have to Well, we try to determine the need for the customer before pushing back on them. But if they do want too much quota or if they're asking for an environment that's full, we definitely turn them away or we give them other alternative options, different environments. So, thank you. Rich, you talked about memory being one of the most constrained resources. James, could you talk a bit about how storage is utilized by our customers and really what the sort of the demands are there and how we address that and how we share out the storage? Sure, so we have a lot of customers using it for a block storage. We provide ephemeral for VMs, but in addition to that, we use SEF to provide the via Cinder the volume storage for the persistent storage. Then we also have customers that are using the object store. They're using the object store through the Swift API and S3. That's backed by SEF. And what do you see as the growth plans in terms of storage? You talked about extra zeros. Mm-hmm. Oh, yeah. So we're, you know, I'm sure a lot of you have this problem as well. You know, the minute you get your new storage system up, like it's full and you got to add more storage. So obviously it's important to talk to your customers and get estimates from them, you know, go a year out. Be what is your storage need going to be? In the past, we've mainly been doing our storage for capacity. And then that's shifting now, especially as people are using more object storage. They're much more interested in performance. So we're having to kind of rethink a lot of our architecture in terms of providing more on the performance and figure out the trade-off between, you know, performance capacity and cost. So I'm trying to think. You can come back to you after you think. All right, all right. Go ahead, Sheila. One thing I want to add is when we mention customers, they're not external customers. They're internal engineering teams within Comcast that are putting their applications onto the cloud. So they're not individual customers, but some of the applications are external customer-facing. So we're doubling, how often are we doubling capacity? It's like once a year? Yeah, in the last year we doubled our capacity and we quadrupled our capacity. Right, it seems like every two or three data centers we stand up were at least doubling the capacity in terms of our storage. And so that's about every six months or something. Yeah, and certainly we do see that the customers use the resources available to them very quickly. We have a pretty predictable point at which a data center will become fully utilized from when we open it to a few months later. It's really quite a short time. So it's kind of a arms race to provision more data centers before we all the capacity gets used. If we're lucky, sometimes it's a couple of weeks and we open the data center and it's already full by that time. Yeah, so it's a good problem to have. One of the issues there, we'll just talk about resources, idle capacity and how do we handle that? Rich, do you want to sort of talk about how we handle idle instances and things that get stuck in a sort of state that the user can't get to them but they haven't cleaned up properly? Certainly. Well, obviously the resources are a very precious commodity when everybody wants to get on the cloud. As James said, it's a good problem to have having more demand than there is a supply. And when you have that, you have to be very mindful of things that are just sitting on your cloud and not being used. And so internally, our engineering team has developed and is still developing a tool to monitor instances that aren't being accessed, aren't performing any service or they could be they might have no connectivity, but for whatever reason, the project has yet to delete it. But it's still holding on to those resources when they could be freed up for use. Right, because the worst thing is to have, you know, to allocate a huge amount of quota and then just have it sit there for a year, not doing anything. I mean, that's not doing anybody any favors. Yep, so ideally we'll reach out. We haven't wrapped a process around it yet, but ideally we will reach out to them. It'll automatically reach out to them and then hopefully they can free up some resources from there. Yeah, we had an interesting intern project to sort of build a tool to help us do that, correct? Does anybody want to talk about that? Yeah, so it's basically the same thing that Rich and James were just discussing. It's an internal tool. It'll go in and look at idle resources and then it will immediately reach out to the tenant members and ask them kindly to please free up some resources. And we will push it upstream once we get it going. It's in testing right now. And that's really key because you will issue out your quotas and some people are going to use their quota immediately, but you'll have other people that were just asking for a test quota and they used it for the project for a couple of months and then they're done with it. But they never remember, since they don't need to ask for you to remove the quota, they just, they'll forget, they'll get busy doing something else. So having a tool to reclaim that space is really going to help your TCO and help your capacity. Yeah, I think one of the interesting observations I've had working with the operations team is that I talked about customers, Sheila said, you know, these are our internal development product teams. These are not customers that are paying us hard cash today. So they're not as incentivized to clean up after themselves and not pay for things they're not using. At the same time, they're also not necessarily cloud native and we can't just run a chaos monkey to go around and randomly terminate them and sort of get a probabilistic cleaning effect. So I think those are some things we'd like to, we'd get added to the cloud, certainly going forward. I think those are important. You have to have ways of tidying up. But in a sort of private cloud, sometimes your only option is to politely ask people, please come up after yourselves and that can be a long drawn out process. And so can I add one thing? So for what kind of drives our data centers is we have these large key customers and they're the ones that are really kind of helping us out, you know, on the financial and budgetary side. And then they in turn help our small customers because basically they help us to set up the large data centers and then the extra capacity we get to use for the small customers to enable them to do their testing without actually having to have, you know, budget upfront for that if they just want to do small scale pilots. Yeah, and I think that's kind of analogous to the AWS free tier in a way. And then we want to sort of have people be able to come on board with a low barrier entry, not have to pay anything and then find a way to recoup some of our costs from our main customers. One of the interesting things for me is that we recently have moved our support channel into Slack from ISC. It's probably become one of the most active Slack channels. And one of the things we consciously did there was to not separate the support channel from the sort of discussion channel. I find that very interesting. It's sort of generated a social support community where folks who are not on our team will proactively answer questions that other users are asking. You guys have been around longer than I was on the operations side. Have you sort of seen any other positive changes coming out of reaching out to the community through Slack or ISC? I think definitely. So on the engineering side, we're using almost exclusively IRC until recently when we made the transition to Slack. And IRC, you know, that's the traditional form of communication. You know, half of us, I don't actually have mine on here, but you know, we have our IRC names on our badges and stuff like that. But you know, our customers weren't using that form of communication medium. So we were using it internally, but we're kind of disconnected. And that's that's never good to be disconnected from your customers, right? You want to be connected to them. You want to hear what their pain points are. Also, you know, Steve's favorite option was it has a mobile client. So and I would say it's my favorite option. But but Steve used it all the time. And actually, I use it all the time too. So I'll get like a 10 p.m. Hey, where are we with this? And I'm like, oh, OK, I got the mobile client. I can reply. But yes, so it's been great for our communication. And also I have, you know, on the engineering side, we have a much more intimate relationship now with our customers, where they can really feel like they can just contact us and ask us a specific question. So I just want to actually talk about one of the things you were just discussing that you are on the engineering side and Sheila and Richard both more on the operation side. We're sort of trying to go towards a kind of a DevOps cloud, NetOps, and drops, what you want type of model. Curious of what thoughts people have about, you know, how we work together efficiently and what things have been working well in terms of working between our engineering and operations team. So so, you know, I've I've worked, you know, I haven't been so long at Comcast, but what I've seen is great when I worked in other environments. We're based out of Washington, D.C. So I I worked as a government contractor for many years. And oftentimes, you know, your operations group is on another floor or they're in another room or they're somehow physically separated, maybe another building. For us, it's literally in our office, it's every other cube. A weave. Right. It's a weave. So so when somebody is having a problem on the operations side, they literally can just lean over and talk to one of the engineers and vice versa. So it really helps us. We don't have to go to formal meetings to understand each other's pain points. And as an engineer, I can just observe readily what the operations team is having issues on. Right. Because that's one of my jobs is to make the operations team more efficient and to help them out with, you know, the more sophisticated tools and, you know, providing them the documentation to use that. So having that close knit kind of DevOps environment, I think, has been very, very helpful. And we have a shared Slack channel, too, that we're in, because we're not always in the office. Right. That's true. Yeah. And I would say something that I've seen that's working really well is our operations engineers like Sheila and Rich, they day to day are solving problems and they find they have a pretty good sense of where they're spending their time and they're each coming up with their own sort of projects, tools that will help us be more efficient and more effective. Do either of you want to talk about either your pet project to improve operations or some of the others that you know about? I know Michael is working on some monitoring stuff and so on. Either you have a pet project or something that you'd really like to see developed that would make your life a order of magnitude better. Well, it's not my personal project, but one of our other team members is currently trying to build a tool that monitors how much quota is given out in terms of each region, each project, the users. And hopefully we would love to turn that into a tool used by the community. You have a pet project? I do a lot of stuff in the community when I have time. So a lot of documentation. That's pretty much one of one of my pet projects. So I just want to talk about one more area and then we can open the floor for questions. One of the things we talked about is we have a very diverse set of tenants, workloads, and it's not necessarily a multi-cloud native set of tenants yet. Some of them are quite sophisticated. Some of them are seeing it more as a migration option from existing infrastructure. So we have to do with a lot of various problems coming from infrastructure and so on. So, you know, what are some of the practices that you've, especially Sheila and Rich, that are sort of on the front line, that you've found are very effective for dealing with customers that maybe are kind of themselves coming to terms with the nature of our open-stack cloud? Maybe it's not quite what they're used to. How do you sort of educate them and get them moving into a cloud-native world? Sure. Well, a lot of people come from a development background where their applications are running on a server and if something happens to that server, they're used to calling support to get that server back up. But we're trying to move them from that sort of take care of it, treat it like your pet to a model where you're treating your cloud instances more like cattle. If something happens to one, it's not worth your time to try to fix it, just shut it down, spin up a new one and we try to make it as painless as possible for them to do that with a touch of a button. Yeah, it's interesting. I was a user of our open-stack before I took on this team and I found that that was one of the really nice things was you didn't have to worry too much about, you know, if a VM was dead, well, you know, you just spin up a new one and we do see this sort of interest on me. A lot of the questions on the Slack channel are basic access questions and we spend a lot of time on that and our users spend a lot of time on that. Whereas in many cases, the right answer is actually you can't log into your VM, leave it, move away, shut it down or let it be shut down by our automated tools later and just start a new one. But getting the customers to a position where they're ready to adopt that model of working is quite tricky. So we've got to educate them. Sorry. Yeah, I was going to add on to that. So yeah, we've just got to educate them all the time and we provide various ways to educate them from PowerPoint presentations to documents to Wiki pages to videos because some people don't like to read and they just want to watch, you know, a five minute video just kind of giving them an overview of best practices. And typically we have a good handle of who our customers are. There are a lot of them, but there are, we do have repeat ones. So people, you know, they might ask us to keep saving an instance. And finally, we'll just get, we'll get on the phone with them and say, hey, we might need to push back on this because it's eating up, you know, five, 10 minutes of our time each time. Maybe you should look at doing it a little bit differently, more cloudy. Yeah, and just my sort of final observation and comment before we open the floor to questions is as a user of the elastic cloud, it was really very evident to me that we had done a good thing for our community of users and product developers and so on, which probably numbers in the thousands internally. We made it so easy for them. And this was, I think, the whole point of building the cloud itself service. They can bring up new instances. They can manage their own security rules, their access to and from the internet without ever filing a ticket. And so we really optimized for developer velocity and time to market. That's something that Comcast has been really focusing on. And I think it really pays off. We've seen those teams that have bought into the model of cloud native and how open stack or how our open stack infrastructure works really get a lot of benefit from, from being able to do a lot of things with, with a lot of the barriers that used to exist, removed from them. So it's been very successful. I think, as these guys said, we've seen that reflected in the phenomenal level of uptake, you know, 100% growth, at least year on year. And I don't think that's going to slow down. It's a little bit like people will come. It's like windows. It will consume whatever resources you give it. So definitely on the storage side, just keep on adding zeros. So we have about 15, 13 minutes of time available. We'd love to hear your questions. We can like to understand who else is running production clouds. What kind of issues you have if you have the same kinds of things. If you see the same things we do, if you have cloud native users, if you have legacy users, more traditional migrations. Are you storage focused or network or compute big data? We have some folks that use our open stack platform to run Hadoop. They have a presentation tomorrow at 1150. Very interesting. Please come to that. So, yeah, let's open up for some questions. Yeah, please take the microphone. Can we get the microphone on please? Q&A, Mike. Quick question. You mentioned once you reach a memory threshold, for instance, you try to then switch over to some other region and so on and so forth. So how do you allocate the VMs? I mean, do you have policies on like, you know, if you pack them up in certain ways, because obviously the average memory can be 70% and still there could be many bare metals which have plenty of memory left if you just go by that. So I was just curious to, you know, I mean, the average is always different from, you know, what the reality is, right, sometimes. So just curious on how you allocate the resources. Well, we measure it by what is allocated overall, whether or not that full memory is being used. We don't currently have bare metal in production. It's all VM layers. So each VM, when it's allocated the memory, that's what we're measuring so that every VM has the possibility of using all of that memory on the compute nodes without causing all of the other VMs to crash. We're using the default scheduler, so they just randomly pick a compute node to go to. And we have enough capacity. I mean, you're never trying to max out your capacity, right? So maybe if you get up to 75% utilization overall, you know, on your compute nodes, that's what you're aiming for. And, you know, the scheduler is smart enough not to actually, you know, overwhelm any individual compute nodes. So it should start filling in the ones that are underutilized because you have like maybe a really big flavor on one of them. So, I mean, 70% is pretty good utilization. Way back, 10% is good, and 75% is pretty good. But how much would you pay for getting the, doing the last mile? Like from 75 to 90 or 95? I mean, quote-unquote, pay, meaning, you know, how important is that to do the next 20% or, you know, it's not that important. I think, can I answer the question? Sure. I'm not just a pretty face. I think part of it is some of the failure modes, high utilizations are not pleasant. Certainly with IP addresses, we find that in many cases, that's one of our limiting resources. If you run out of IP addresses completely in a region, then that's just really bad. Memory, we're pretty conservative about how we oversubscribe. We've been slowly turning that dial up from actually a little less than 100%. We're now up to about 150%. CPU, we don't find, you know, you can oversubscribe CPU pretty easily and that's not a problem. And I think partly we're limited right now by scheduling. We have users that want to have pretty significant chunks of the resources in a region and that starts to become, that's really the scheduling problem. We sort of have an interesting workload where we have a lot of, a number of big users and then a very long tail of very small users and you can kind of spray the small users all over the place but where you have to be very careful is how do you put two big users. So for example our big data platform and one of our other major tenants, they don't play nicely together. So storage scheduling is something we're very interested in, in sort of future product releases. And yeah, I think your question is a good one. Certainly how do you manage the big customers, the elephants and you know, we do some coarse grain stuff, you know, put this guy in this data center and this other tenant in this data center which obviously helps but it's not the best way to do it. So our ephemeral disk on our compute nodes is a solid state and it's a fair amount, you know, it's into the terabytes on the compute nodes. So I have, you know, some customer workloads where having access to the solid state on the ephemeral is something they're very interested in doing but we have to be careful in terms of, you know, creating them these jumbo flavors. I think one of those customers might be here, I think, I see Chris in the back there somewhere. Hi Chris, yeah, he's hiding right now, that's the guy. But yeah, so we just have to keep an eye out for that. He also mentioned storage, so real world for me is, you know, I'm kind of manning the CEF systems and in CEF what we found is that you get much over 75% utilization and the algorithms it uses doesn't balance evenly across all the disks. Just by definition it's a compromise between the speed of calculation to find out where your data is and leveling it. So you wind up with imbalances in between the disks. So around 75% is about all you want to use in terms of your capacity because once you, I think once I go to about 78 or 79 I get into a warning state in the state in CEF where I get something called near full OSDs which is like I'm okay if they're near full, if they're full I'm going to have a bad day. So we aim for about 75% utilization on our storage. Yes, so can we pass the microphone back just to this? Shout loud, go ahead. Good question. Does anybody else want to? I think... When don't you choose to get new hardware? That's a continuous process. It is every day we are evaluating different hardware and different vendors and again, you know, it's continuous, right? So you're meeting with your customers. What are their projected workloads? You know, you're meeting with vendors, you know, because technology is moving at such a rapid pace. You know, we've got... Samsung's got a 16 terabyte SSD right now, right? And a year ago they didn't, you know, we didn't have those capacities. Even, you know, spitting platters are going up in terabytes. So you're constantly evaluating that new hardware to see if it'll meet the capacity. And the goal is right for it just to match. You know, you don't want to be buying twice the amount of hardware for the next year. You know, the key is just getting it just in time that everything works. And you know, so it's the... And that's why you need that really good communication with your customers to figure out, okay, what are their estimates? You know, and take a little bit, you know, provide yourself a little bit of a buffer for that. And in terms of justifying it, when we get new hardware and deploy a new OpenStack region on it, we just recently released a region. I believe it was in June, and by September we had already reached 70% utilization of it. So it's very easy for us to justify getting the new hardware because we have to keep getting the new hardware because we have so many customers. Yeah, I would just add to that. Am I on? I am. You have to be ahead of the curve, really. In a large company, like Comcast, procurement takes a long time. Even putting aside the large company thing, I mean just getting vendors to ship you stuff, they have delivery time, lead time, and so on, and so you can't be proactive, I mean reactive, when it comes to building out data centers. And it gives your users a poor experience. Right now we run with two data centers that are open for new tenants, but once we get down to one it's really kind of not ideal, and so we really want to sort of, I think, try and get ahead of the curve and be proactive about, and this is one of the nice things about the fact that we run at such a high level of utilization in terms of justifying the investment in the data center, we see that that's reflected in people will come and use it. If you build it, they will come. It's a different argument to go to the finance people and say give me this much money to get out ahead of the curve, and I think that's something which is harder, especially in highly multi-tenant environments where you can't just point to one big customer and say, look, this guy saved a lot of money by moving from AWS to here, and that constitutes 90% of my cost. We're spread over so many tenants. Our top 10 is less than 50% of our total resource usage, and so it's hard to kind of build an accurate model for where your saving costs. We're pretty confident we are, but pulling all the pieces together to justify that investment to get out ahead of the curve is really, and so this is an interesting non-technical aspect of production cloud at scale. Talking about scale, you know, scale out, just kind of expanding on this answer a little bit. One I think of the most interesting challenges you have is you're going into a new data center and they say, okay, what's the amount of how many you do you need? How much power do you need? What's your BTU of consumption? How much network do you need? So you give them these numbers and they say, okay, we have this for you, and then you roll out all your hardware to this new data center. This is one of the beauties of scale out and kind of one of the curses. Now a year goes by that data center is at capacity, and you know, OpenStack, SAF, they're all designed for scale out, but now you have a problem, right, because you're in a data center and it's full. You know, you don't have any more network drops. That's one that hits me all the time is, you know, I'm out of network drops for say something like storage, or that you are long since gone. So the way we do our scale out is more in terms of the new deployment, right? It's all about, you know, kind of built out a data center. That's pretty much where it's going to be. The scale out is going to happen on the software side when we're talking the new data center, and we can use the same software, but we can double everything. We can do double the storage nodes, double the compute nodes. And then the plan is, you know, you just decommission the old sites. That's the easiest way to do it. So I've got time for one question from this gentleman over here. Speak loudly, please. Yes. Right. So they are definitely, you know, we've got a lot more customers asking us for object storage consumption, and we tend to see very high write workloads for that. Not to say that's the only thing. I have a little bit of tunnel vision right now because the customers I'm immediately dealing with have a lot of high write asks. So that's one difference. Versus when I was dealing more with block storage, it was really like, how many disks can I put in a chassis? How can I get my dollar per gig down? Versus now, the customers that are asking me for object storage, they're like, hey, you know, the latency is too high on me returning this object. It's, I'm getting, my application is getting behind because I can't write quick enough. You know, there's, it's like when you have a file system, right? You don't sit there with a stopwatch timing every single file, but when you've got customers using object storage, yes, that's what they're after. You know, so yeah, that's kind of the change at least I'm seeing in terms of object storage. So when you have customers asking for that, start thinking a little more about performance. That's at least what I'm doing. All right, we have to terminate quickly. I'm going to give the last word to Sheila. I'm going to do a little quick recruiting thing. Oh yeah. I don't wear this shirt just as a fashion statement. Comcast is recruiting. We'd love to hire storage, networking, open-stack engineers. Sheila, tell the people why it's awesome to work at Comcast on our Elastic Cloud team. Quickly. It's awesome because you have a lot of flexibility to work on stuff that you really, really are passionate about. It's a great team management all the way down to the obs engineers, the developers. It's just, it's awesome. See any of us. Yes, we will all be here through tomorrow at least. We'd love to talk about what we do. Any more questions, please find us. You can also find us in the virtual world. Thank you all very much. Thank you. Arigato gozaimasu.