 It just shocked them now, they don't know what's going on. All right, welcome, welcome everyone to Deploying Lots of Telco Clouds. Today, we hope to walk you through our journey of deploying a massive number of both large and small sites across the globe at AT&T and the challenges that were involved in us with that experience. So first, a brief introduction to AT&T, who we are, what we're doing with OpenStack. To set some context, AT&T is currently experiencing an explosive growth in network traffic. So on an average day, we're transiting about 114 petabytes of data across our network. And at the same time, the number of services that we want to deliver to our end users and the speed at which we want to deliver those is rapidly increasing. So really the only way for us to address this challenge is to change the way that we build and operate our networks. And so at the core of this transition is us replacing our traditional purpose-built hardware with software components and virtualizing our network functions. So to accelerate this, we've adopted OpenStack at the core of our cloud platform. It's certainly not been without its challenges, which we'll get into in a moment. But let me introduce the speakers. First, myself, Alan Meadows, I'm a cloud architect. I'm helping AT&T build a massive number of cloud sites across the globe. Lee Revere, principal member of technical staff, doing, thank you, deploying lots of clouds as well, obviously, and a network engineer for more years than I can count. I'm Mike Wilson. I like to interrupt people. I've been designing, planning, and deploying OpenStack clouds for the last four years. And the opportunity to do it with AT&T over this last year was great. So I'm happy to be here with you guys. All right. So first, we want to talk a little bit about what challenges telco clouds have. A lot of them are fairly unique. So we just want to name a couple of them. First, typically, they're deployed across a large number of sites. Usually, the purpose of that is we need customer proximity. We want those NFV services to be as close to the customers as possible. They also tend to introduce a lot of complex network requirements that other clouds don't have. And finally, the NFVs that are running on those telco clouds tend to introduce some of their own requirements, such as requiring specialized silicon running on the compute host in there. In addition, telco clouds also operate at large scale. So we're also running into the same problem a lot of the other large scale OpenStack deployments are running into, especially scaling and deploying OpenStack. So at AT&T, we had a very large number of clouds to deploy. And we are also deploying these all into distinct physical locations. But this did come with several benefits for us. The first meant that we could place our cloud network services really close to our end users. The second is that this distribution meant that we could configure different zones based on the regulations and requirements of the countries and regions in which those clouds were deployed in. And the third, with our shared nothing architecture philosophy, it meant that each of these clouds were unable to be self-sufficient. They were resilient, and they had as few external dependencies as possible. And this ultimately just increased their uptime. So to build so many green field sites and under such an extremely compressed time frame was a real significant challenge for us. And we really want to have the audience keep in mind that a lot of people focus so much on deploying OpenStack and the challenges that go into that, they often forget all the prerequisites that go into actually getting to that step in the first place. These are things like cabling, inventory, networking, the actual site build out of racking and stacking the hardware, and all the periphery enterprise components that generally need to plug into OpenStack, such as, again, monitoring and inventorying. So on top of that, our requirements demanded that we have a solution that uses the same code base to install both very large clouds and very small clouds, which means we needed a single solution that could deploy clouds with hundreds of compute hosts, as well as hosts with as few as two or three servers, which is typically the size of the equipment that we have at the last mile in AT&T central offices. So to solve for these challenges, we embarked on what we call today the AT&T integrated cloud, or AIC for short. This is an OpenStack powered cloud. And essentially, AIC is how we're bringing the AT&T network into the cloud and how we're dealing with that explosive growth. And it's all based on OpenStack. So the engine inside AIC that essentially drives our solutions to these problems is, of course, automation. So our automation goals obviously are and continue to be fairly ambitious. So I like to say we want to go from the data center loading dock to a working cloud. And really, these buildouts occur in three phases for us. The first is design, then deployment, and then management. And we really successfully applied automation to all three at AT&T. And I really like the circular graphic here because these phases are all really highly interrelated. So they feed data into each other. And so the data that we collected the design phase, which we'll touch on in a minute, is fed into the later phases. And when we go back to revisit sites to augment them or expand them, really, this cycle just continues. And all of our AIC tools have essentially been built from the ground up to work together. So in this presentation, we're going to cover each of these three phases. So we're going to cover design, deploy, and manage, and essentially how we're able to accomplish our goal of managing such a large number of sites and take them from the loading dock to working cloud. So at this point, I'll turn it over to Lee Revere. He's going to explain how our OpenStack site design process works. Thanks, Alan. Hashtag the real deal meadows. So automating design, this I consider my wonder graphic. I wonder who made it. I wonder what they were thinking. I wonder how this can be used as anything other than an example of what you don't want to see happen. Over the years, for many years now, the design process has been really manual. Usually, we create spreadsheets, a Visio diagram, a number of things that, depending on the size of your organization, can be outdated before the implementation is ever complete. Never mind six months or more when you go back to expand a site. You usually have to call people and do a manual audit to find out what you still have on site, what's been used, changes that have happened. And again, these manual steps were great for the past couple of decades when three to six months was an acceptable deployment time. But with our clouds, we're trying to go at a much more rapid pace and at a large scale in number of sites. So we had to eliminate a lot of intergroup handoffs, manual processes that really slow down the deployment phase. Everything needs to be looked at from an automation first perspective. If it's not being automated, it needs to be challenged and looked at on how it can be automated. We had no single view for a site, and this is pretty common. You have a server group that maintains all of their server equipment and what they've deployed. You have a network group that has their own documentation. And we happen to have a lot of server groups and a lot of network groups. And we didn't always have the same inventory system or management applications. We had different groups that owned different types of sites like COs, EDCs, different sized data centers. And they would even do different naming standards, IP schemes, cable layouts. And so you really had to track who owned the site to figure out what they needed to deploy it based on what their standards were. This caused spreadsheets to grow to, in some cases, like 23 tabs. You often had repetitive data. So not only did you have to enter this all manually, you gave them more options to make human errors to ensure you slow down the process. Now, if your goal was to give more opportunity for human error and to delay things, I'd advise you to include a copy of the Wonder Graphic, because it goes nicely with that option. My first reaction when I saw this honestly was, what the fook? And since my boss is here, I'm not using bad language. I said fook, and I'm referring to the street that I actually lived off of when I was in Hong Kong. I knew I'd find an opportunity to bring this up somewhere. So I couldn't see any way to top the Wonder Graphic, and it's perfection of complexity. So instead, we went the other direction and tried to make things easier all across the board. So what we came up with with AIC formation, not a blank slide, but I'm working on being a triple threat presenter, meaning bulleted slides, graphics, and a transition. So I think I've got them all covered. This is a home screen for AIC formation. What we did with AIC formation, it's our cornerstone application for automating the process. And the result is what was taking 10 to 15 days and I'm talking about one person for two to three weeks doing one site. We can now do with AIC formation in under 15 minutes. And I think that's really huge. Imagine taking two to three weeks of a drive time out of your work schedule or dropping it down to 15 minutes or two to three weeks worth of meetings to 15 minutes. That's something I can only dream about. So the automation process provides more details. You don't have the spreadsheets to deal with, obviously. You don't have the handoffs. We've eliminated most of those. We've reduced the possibility for human error to a very small amount. And the automation doesn't just produce documents or outputs for it. It provides our automation deployment directly the files that are used to do the deployment. It feeds to management applications. So if you've ever maintained, let's say, Nagios or something, and when you created a site, you would go in and add that site and start collecting data. If you automate that process during the design, you always have something that's current and you don't have to have a person who has to make sure that's updated every time you add a site. So the further we delved into what we could improve, this thing just grew exponentially. So we have a lot of groups now coming in wanting to get their stuff added to it. We've got more features on the feature request list than I can predict how long our scrum teams will need to fulfill it. And we're replacing a number of applications in AT&T with just one application. So a lot of people think of this only as a design tool and you'll also hear a pause on most conference calls waiting for me to express my displeasure at just calling it a design tool because I'm looking at this as a self-supporting ecosystem of applications. You create the design, you feed the automation, you maintain the site information. Nobody has to go out and find out, do I still have ports left on this switch or do I need to add cards for more capacity when I wanna add X number of computes? We feed directly to, I mentioned our automation deployment which is Ops, Simple and Fuel. We also create things like the DNS entries for our management domain name system, the enterprise DNS. We create firewall rules and we also make all of the data available via REST APIs. So for future applications that need the data, they just have to be able to reach out to our APIs. If you look at many purpose-built legacy tools, you can find that there might be single teams that have a tool that are collecting information that's not available to anybody else if you don't know about it or if you have access to it. So sharing all of the data was one of the main things we wanted to make sure we did. Creation of a new site starts with just two tabs. You can see that the first tab is site-related information and the second tab is zone-related information. So we have individual sites that have multiple zones. So another thing we did was automate the entry in this regard. If you have multiple zones, a designer might have to go look at another reference to figure out what zones have been deployed and this keeps track of that so they never have to go to something else to find information that they need and it'll prevent them from creating a duplicate. Almost 80% of what we can build comes from just these two slides and what you might ask is that, well, here you go. How about a rack elevation? This gives a lot of information that I'm not sure how well you can read on this but you can see things like, where's my laser pointer? Can you see that? That's not working. In the upper corners of the racks, I know nobody over there is gonna see that, are they? You can see the rack name. You can see the power consumption of all the devices in the rack. We use this more than just pretty numbers on the screen. If a designer were to add more nodes than could be supported by the power of the site per rack, it would show up in red. It would further show the number of nodes that would need to be eliminated to meet the actual power requirement of the site because we have some sites like COs that might only support four to six kilowatts per rack and that's not a lot and in this case, they wouldn't be able to submit the design until they fix those issues. This slide is a pretty standard, at least in my experience, cable mapping spreadsheet or a cable map that's usually done in the spreadsheet. So you just have things like what type of cables from where to where and again, all of this is done through the automation from the first two tabs. If you've ever done one of these with a site that has 500 or thousands of cables, this alone would be a reason to want this application and human error is almost a given when you start doing that number, many spreadsheet cells and repetitive copying and pasting. This is a zone summary. So what you're able to track from this is that when the first time the site was deployed, a summary of what was deployed starting at the top green row and then the next three expansions that were done in that site. So you can track per VPMO we call it or purchase order what was done during each of these deployments. We also implemented role-based authentication control. So your access to this information is varied by your role, whether you can just view information or edit information. That's important to ensure not everybody's in there making changes or has access to maybe pricing information, which we also collect here. And we keep them from deviating from the gold standard that we've set. So if somebody tries to add something that's not a standard for our deployments, it will track who made the change and they will have to provide justification for making the change so we can track that. Another way to think of this is we feed the machine validated output from AIC formation directly into deployment automation that can take bare metal to production ready cloud with minimal effort on the user's part manual entries. To continue our riveting trilogy of automation, I'll turn it over to my esteemed colleague, Mike Wilson, who will explain where the design lets up and the build picks up. Thanks, Lee. Thanks, sir. So I just wanna ask you one question before you exit stage. Sure. How long did it take before? Two to three weeks for one person to do one site. And now? Under 15 minutes. Okay, thank you. Thank you. Let's see, how do I use this clicker here? So I'm here to talk about deployment. I hope you guys understood the radical transformation that happened at AT&T going from a three week process to a 15 minute process. It's really important, but I wanna talk about the deployment and the technology behind it. So we have lots of good data in AIC formation and we can track all the sites and we have all the physical characteristics, the hardware characteristics. But now our challenge is to go out and provision all these clouds, right? And it has to work. And by the way, some of these clouds are very small. So it's not like I can take up a whole server that just sits there and pixie boots things, right? I have a limited hardware footprint that I can be in. So this thing needs to be, it needs to be a simple form factor and it needs to be disposable, essentially. So what we did is we built a little bootstrap image. It's based on the Debian minimal installer and it's got a few of, you know, it point star package repos, does a couple of things that you would expect. And then the cool part about it is that it takes a feed from foundation and it uses that to start the site deployment. So when we put the bootstrap in, we'll typically mount this small image over the network. We'll boot things up. And what we end up with is a set of capabilities that let us bootstrap the rest of the site. So we call it the undercloud. I think this is an overloaded term, but whatever, I think people understand. So the undercloud, at this point, bootstrap is done and it contains a couple of, laser pointer, it doesn't work on screen. It contains these five elements. Pixie dust, which is some proprietary magic at AT&T that I'll explain later. Metal as a service, fuel, an access host, and in some sites we may have some of these fair vCenter components. How many people here have ever dealt with lots of authoritative DHCP servers on a single network? How'd that work for you? Okay. So you might be confused like this guy. When you see that I've mentioned that we have fuel, mass, ambient, we're all to deploy all in one cloud. So we really like these tools, right? We've worked with fuel in the past. We appreciate it for its open stack know-how, for its simplicity, you know, it just kind of works. It deploys open stack and it has a nice plug-in infrastructure or a framework. Metal as a service, same thing. A lot of our supporting systems, Nagio systems, other Linux systems, we're used to deploying it with Metal as a service. We know how it works. And of course, if we want to be in the supported path for VMware stuff, we need to be using their tools. So this was a conscious choice to enable all three, right? We wanted all three. We wanted them to be able to coexist peacefully. So Pixie Dust is a pretty simple solution. We basically want to intercept all of these DHCP calls and we want to get them to the proper place. So again, our step zero, we get some data that's provided by AIC formation. And that goes in and we populate, well, you can see in this database here, we have kind of two tables. We have a boot target and we have rules. Boot target is just what it sounds like. Where do I go? Rules are more complex. Here's where you can specify conditionals and, you know, more complex business logic as to what kind of hardware goes where in what circumstances, et cetera. So we take the information in AIC formation. We populate that. Then we're ready to take new servers, right? We have the Pixie Dust service running. So a new server comes up on the network. It does its normal Pixie Boot thing. And Pixie Dust is the simple daemon that catches that Pixie request. We do that with a DHCP helper, by the way, or a DHCP relay. So you will need to configure that if you were to do something similar. This comes in. Ipixie is, many of you might be familiar with it. It's an open source bootstrap. So it gathers a limited set of hardware information. It sends that off to the Dust API. And this is where the Dust API takes that hardware information and computes the rules. It will return a boot target and then Ipixie is gonna chain load that boot target. And I haven't shown failure here because that never happens. But on success, we then send updates to post processors, to foundation, to whatever needs to know about it. We also mentioned fuel. So the role that fuel plays in this, we use fuel to automate all of our OpenStack installation. We also use its plugin framework and its orchestration abilities to deploy most of our non-OpenStack components. But fuel, at the time, I mean, we're using 6.1 to deploy all these sites. And now we're at 9.0. There's a big difference in capability between 6.1 and 9.0. And actually a lot of this comes from this AT&T experience. So for example, data center racks in L3 islands. This is how we do networking at AT&T, at least in these locations. Fuel has kind of a macro role concept. It has a very monolithic controller. AIC doesn't have that. It has much more granular roles, much more granular deployment types. And in fuel, it's very opinionated about what your configuration options should be. So that didn't work for us. We wanted those configuration options to be mutable, not fixed. So it was fun to work with the upstream fuel team, proposing some of these changes, getting the code upstream. If you go look today, you'll see the stuff that AT&T contributed. And yeah, we're looking forward to working with fuel going forward in the future. So of course, you might be thinking there's something missing. There's a lot of foundational blocks. There's a lot of capabilities that I've talked about. But we need an orchestrator. This is where Ops Simple comes in. So Ops Simple was born, I think probably an executive said, I just want Ops to be simple. And so we named it Ops Simple. Really good name. So it's an Ansible-based orchestrator. It pulls configuration metadata from a central repository. This is a Git repository. And again, this is fed with the data from foundation and from post-processing steps. It calls fuel APIs. It calls MAS and Libvert APIs. It will actually go in and execute some commands for pre and post deployment tasks. It has hooks into our OSS and BSS systems. And like I said, we keep track of this all as code. There aren't humans necessarily involved in this. There's code. There's validation. There's CI around it. It's a good thing. And logical configs, I already said this, they're sent back to Formation. So yeah, that's Ops Simple. I want to turn it over to Alan. We'll talk about what happens after this. Thanks, Mike. So finally, we come to managing all these OpenStack Clouds once they're deployed. So lifecycle management really becomes infinitely more complex. The more independent zones that you actually have to manage. And at AT&T, we're managing a really large number of zones. So we mentioned before that our goal was to reduce the number of site dependencies, the number of external dependencies, rather, with our shared nothing architecture. So this left us with a challenge, really, of how do we manage all of those loosely coupled in independent sites. So we had several goals that we wanted to solve for with this challenge. The first is we wanted to provide a consistent experience to our tenants. So no matter what region they were interacting with, we wanted the same flavors there. We wanted the same images there. We wanted other OpenStack resources like this, like Murano catalogs to all be consistent across all of our sites. The challenge here is that these things are not static. We don't just need to load them once and call it a day. They're constantly changing. We also needed account management to be consistent and scalable. So solutions like pushing accounts into the edge and giving those edges authority over those accounts so that we could remove centralized back ends like a gigantic corporate LDAP server that could either become unreachable or fail was something that really interested us. And finally, we need to be able to perform rolling upgrades to our sites and ensure that those sites had no interdependencies whatsoever across their OpenStack services. And given all of these independent and loosely coupled clouds, we needed a way for our tenants to actually be able to find them. So enter our operational challenge. So our solution here was that something we developed called the OpenStack Resource Manager or ORM for short. You guys might have saw it in the keynote. So the ORM is a collection of API services that we developed that help us control our distributed clouds. Now we get the best of both worlds with the ORM. So we have these standalone sites. They're very reference architecture oriented. All of the OpenStack components really are only really aware of themselves in one site. But at the same time, we still have the ability to manage all these loosely coupled sites. And really the ORM has two fundamental core pieces. It's a resource creation gateway. So all resource creation requests, such as tenant onboarding, loading new images into sites, flavor creations, Murano catalog updates, all of these sorts of things, they flow through the ORM. And secondly, it's a collection of micro API services to allow our tenants to discover our sites and the capabilities in those sites. So let's briefly walk through these two things. So as a creation gateway, the ORM really ensures that our sites, all of our zones are consistent. So this is important because our developers are depending on this. So using the ORM, either tools or operators can essentially leverage a single set of APIs and they can inject things such as images, flavors, quotas and so on across all of our clouds or to a specific subcategory of our clouds like all of our small sites or all of our large sites. So another example where the ORM has leveraged is account creation because for us, account creation is really a complicated process. It's a lot more than a username and a password. There's a really complex workflow that new accounts need to flow through at AT&T. It includes both open stack setup and touching things in the outside ecosystem. So to ensure that account creation goes through a central gateway and all of our business rules are applied to a new account, the ORM offers an API service that instantiates accounts across all of our clouds. This does two really important things for us. The first is that it establishes those tenants in the zones they should actually be able to deploy resources into. The second is that it sets up ancillary things that those tenants require like private images, private flavors, the things that are important for them to do the workloads that we know that they're gonna need to do. And the second is it also instantiates things like quotas in each of the zones that those tenants are in. And so finally, once that's actually done, those zones actually now control their own authentication and authorization and we no longer have that centralized authentication system dependency that might fail. I should also point out that while the ORM is a brand new REST interface that we did develop, the way that the ORM processes these back-end requests is by pushing all of that into the OpenStack clouds using the native OpenStack APIs. Most notably, we leverage heat. It also subjects itself quite nicely to being stored in version control and being independently validated. And we also ensure consistency in each of the sites by running an ORM agent in each of the zones and essentially ensures that the zones can make sure that they're actually reaching the consistent state that we want without a central ORM process continuingly polling hundreds of sites. Finally, really the last way that the ORM really helps us manage our cloud collection is by offering several API services to tenants themselves. So the tenants can use the ORM to find cloud regions they have access to and also the capabilities that are in those regions. So capabilities can be things like what OpenStack services are running in those regions but it could also be things like how much egress bandwidth is there in these regions or whether functionality like DPDK or SROV is enabled in these sites and other platform questions like that. It also provides our tenants a higher level querying service that's not really capable, that's not really built into OpenStack today and these are things like give me a zone that I have access to that's in Texas or give me a cloud region that's less than 10 milliseconds away from my calling IP. And really the ultimate place that we wanna take the ORM is to allow our NFV tenants to go ahead and describe their workloads, all the requirements that they have and allow the ORM to essentially cloud schedule them according to their needs and according to the business needs and really take their request all the way to provisioning with a single call. So finally we come to our lessons learned and our future plans and I will go ahead and let Mike Wilson here start with the first lesson learned. Sure, first lesson learned, work upstream when developing on these open source projects like fuel or metal as a service. The second lesson that we obviously learned was you obviously you cannot invest too much in CICD. Obviously a lot of people know this, it's the cornerstone of our progress, our ability to automate releases, our ability to validate that what we're checking in is obviously not going to break our platform and all the testing that we had to develop in order to actually come to valid releases that we can ensure we could release across hundreds of sites. And then overlapping functionalities are okay but be prescriptive or your teams will use them in ways you never considered. So when there are always more opportunities to automate my rule of thumb is if you've done it more than once you should automate it and open your mind this includes things like have I racked a data center more than once, automate it. All right and so the final lesson is obviously infrastructure elements are not islands. Not islands. So obviously one of the things that OpenStack has really helped us with at AT&T is breaking down some of our traditional enterprise silos. So our networking teams, our DevOps teams, our developers, our system sys admins are essentially working closer together than they ever have been before. Obviously we still have a ways to go on that front but it's really helped us transition and obviously to work with OpenStack you obviously you have to remove those barriers. So thank you everyone for listening to our presentation. Thank you all. Yeah. That is my thing we're gonna do. So I believe we're the last session of the day. There's nobody else coming in here. So if you guys have questions, now's the opportunity. Can you come to the mic please so that's on the recording? Nicely. How are you handling technical debt? Either for old sites or how do you keep your current sites up to date in terms of updates and frequency and such? We accrue it. Right? You wanted it, can you go ahead? Sure. Yeah. I mean there is a virtuous cycle of accruing technical debt and then there is the non-virtuous version where you never get rid of it, right? So as far as for example with the fuel contributions, right? We didn't wanna keep that downstream. We didn't wanna keep that close to our chest. We wanted that upstream and maintainable. So our process was to have that capability when we needed it, develop that capability early but committed upstream now. So now everyone can use it. We enjoy it, we enjoy the maintenance of it. We enjoy the continued testing of it. So yeah, I mean accrue the debt because it's valuable for the business function then get rid of it. That's the virtuous technical debt cycle. A second answer to your question is early on we did learn of the pitfalls of having some AT&T sites fall behind in their platform version. So we experienced such great pain with that that that is no longer something that we do. We have one single unified platform. All sites come up with the same version. All sites get upgraded. If you wanna talk about open-sack upgrades that's a whole other topic. Yeah. Any other questions? Thank you. Very good presentation. My question is from your presentation I can see you have been deployed your application. And I'm not concerning about the upgrade of the application part but for the infrastructure part, like the open stack. Have you done the upgrade of it? And like from Juno to Liva or Mikata, something like that? Sorry, is your question referring to infrastructure elements or are you talking about open stack itself? Open stack itself, you can understand. Yeah, yeah, done that lots. You've done a lot. So can this kind of upgrade is totally seamless service impact without service impact? Again, that's like that's a whole other presentation but really partly you have to break it into two sort of upgrades. Where you have a whole paradigm shift. Perhaps you're changing from out-of-the-box neutron and moving into a whole new network controller or something like that. And that's happening as part of your upgrade. Those tend to be impacting and it's difficult to do something about them. Then there's just open stack version upgrades. And where your paradigm hasn't really shifted and you're leveraging the same backends that you were in the previous release. We have a lot of different types of upgrades we endeavor to make those completely non-impacting if that answers your question. Yeah, great. And can this kind of upgrade automatically? Can we do it automatically? It's part of our release development. We're automating it. Yeah. Great, thank you. Any other questions? All right, you've been a great audience. Thanks.