 Hello, and welcome to my session on building, managing, and operating a decentralized cloud for telco apps lessons learned. My name is Gene Bagnell. I'm an associate fellow in Verizon's technology, architecture, planning, and strategy group where I'm a cloud architect and a cloud native evangelist for telco apps. I've been in the telcom industry for 35 years now and with Verizon for the past 16. Prior to joining Verizon, I worked as a VAR and as a consultant for telco and wireless operators. I'm also a recent addition to the DMTF Board of Directors. In this presentation, I'm going to talk about some of the lessons we've learned while rolling out cloud native platforms to support the virtualization of key network components over the last few years. Some of these lessons may be obvious to you, some may be not, but my hope is that everyone will get something out of this presentation. The Verizon Cloud Platform, or VCP, is a telco cloud supporting network function virtualization for all of Verizon's business units. It currently hosts components of our 4G EPC, 5G Core, IMS, SDM, and Volte platforms, as well as managed services for our business group and components of our FIOS platform. It's been designed to deliver the high throughput and low latency that each of these applications demands. We do not host public workloads on VCP and have a separate cloud platform for traditional IT workloads. VCP began as an open stack platform in 2016 and in 2019, we introduced container virtualization to the platform. In 2020, we extended container virtualization to the edge of the network to support our VRAN initiative. Before we get into the lessons learned, I think it's important to have an understanding of Verizon scale, VCP deployment models, and a location to where we deploy VCP in our network. VCP is a global platform. There's an international component that supports our global fiber deployments and provides in-country access to the managed services our business group offers. Domestically, VCP provides the infrastructure that allows Verizon to deliver services to 200 million points of presence. This includes our 5G, 4G, and Volte wireless offers. Plus, domestic managed services for commercial accounts and FIOS. To do this, VCP is deployed as a distributed platform across the country with each VCP instance acting as a standalone cloud. This allows us to put VCP in the services it offers closer to the customer base, reducing latency and improving the customer experience. To implement and manage VCP and the services it hosts, we depend heavily on orchestration and automation tools. We have also adopted stringent testing protocols and a depth option mentality towards running the platform. However, Verizon scale is one of the biggest challenges we face when supporting VCP. Many commercial and open source software systems cannot deal with the 120 millisecond coast-to-coast latency nor can they scale to manage or deploy tens of thousands or hundreds of thousands of instances. So most of our issues come from scale and we'll talk more about those in a moment, but before we go there, let's briefly talk about how the network is laid out. Our network is like the layers in an onion. At the center are the data centers, which we call network equipment centers, or NECs. Our NECs host common systems for the wireless FIOS and managed services platforms. Going out one layer are the service access points or SAPs. SAPs host EPC elements, including PGW and MME, and 5G core elements, such as UPF and SMF. Volti functions, including the SBC and the MRF, are also deployed here. The next layer are the access sites. Traditionally, these are used for cell site to SAP backhaul aggregation. With the introduction of 5G and URLC services, we are moving more and more elements to the access sites to lower latency. The last layer of the onion are the cell sites. The cell sites come in two flavors, CRAN and DRAN. CRAN sites are commonly found in metropolitan areas and control multiple cell towers, while DRAN sites are found in urban and rural areas and are typically deployed at the base of a cell tower. In the NECs, SAPs, and access sites, we have deployed open stack and Kubernetes for VNF and CNF workloads. At the CRAN and DRAN sites, we're actively deploying Kubernetes to support VRAN. This is a DRAN shelter. This is possibly the worst place in the world to deploy server hardware and a Kubernetes cluster. DRAN shelters are classified as class two telco environments because they use ambient air for cooling. The hardware in the cabinets is hard to operate at higher operating temperatures and humidities, and also runs on 48 volts DC, which means it needs to be nebs compliant. The cabinet in the center of the shelter houses the base fan unit, or the BVU. The role of the BVU is to convert from radio protocols to TCP IP. The BVU is a physical network function, or PNF, that is comprised of two components, the data unit, or DU, and the control unit, or CU. Verizon is virtualizing its RAN infrastructure, which means putting commodity hardware in the BVU cabinets to implement VRAN. So what is VRAN? Virtualized radio access network, or VRAN, breaks the components of the base fan unit into microservices. The control unit, or CU, becomes the VCU, and the data unit, or VDU, becomes the VDU. Virtualization of these workloads has a lot of benefits. Commodity server hardware is less expensive than the proprietary BVU platforms. It also takes up less room in the cabinet and consumes less power. However, I believe VRAN's biggest advantage is the flexibility it offers. With VRAN, we can replace the software in a couple of hours, and hardware upgrades can be installed in parallel, tested, and then traffic can be moved from the old platform, the new. Another advantage is the ability to test new versions or features of VDU and VCU using canary deployments. This results in faster delivery of new services and features to the customer. So now that you know a little bit about VCP and the workloads it hosts, let's get into the lessons learned. I've broken these lessons down into the following categories, hardware, deployment, platform, automation, and monitoring. In 2016, we built our first cloud platforms on enterprise class hardware that met ASHRIE A3 temperature and humidity requirements. In 2018, when we started to work on VRAN, we realized that we needed to adopt ASHRIE A4 for better based on the environment that VRAM would operate in. At the time, few OEMs supported A4, which resulted in us partnering with a couple of OEMs to build new classes of server hardware that could support VRAN. So what lessons did we learn? Validate that your hardware actually operates at the temperatures you need. We've seen A4 rated hardware shut down at 40 degrees C, and A3 rated hardware that was still running at 50 degrees C. Verify that temp alarms have the correct severity levels to find. Out of the box, critical and shut down temperature alarms and notifications are often classified as informational, which means they often are not displayed by the management platform and therefore not acted upon. Validate that initial critical and shut down temps are correctly set in the BMC. Many BMC support ranges for these, so make sure that you have them set where you want them to be. And finally, find a way to test the temperature alarms and shut down temperatures for your servers with all the components installed. More than once, we've had a NIC or an accelerator card that's shut down due to a temperature well below the top end of the ashrie rating of the server. One of our goals with VCP, especially at cell sites, is to use fiber for all network connections. This means a combination of 1 gig, 10 gig and 25 gig connections to the servers. For 10 gig and 25 gig, this is generally not an issue. However, the same can't always be said for 1 gig connections. Some combinations of NIC, SFP and switch will not auto negotiate down to 1 gig. For us, this problem is exacerbated by the fact that we have multiple switch vendors and have also moved to using third party vendors for our SFPs. When this happens, you will need to make changes to your switch configuration, and in a few rare cases, you may need a firmware upgrade for the NIC. The lesson here is make sure you test all possible combinations of NIC, switch and SFP to see which ones worked out of the box and which ones may need a config change or perhaps even a firmware update. Since we're talking about firmware, part of our automation for deployment and ongoing operations assumes a consistent manner for handling firmware upgrades. Using Redfish for BMC and BIOS firmware upgrades works consistently. But what about upgrading the firmware on a NIC, an accelerator or even local attached storage? In 2021, outside of the BMC and BIOS, PLDM is still not widely implemented for firmware updates to peripheral devices. In fact, many OEMs are still developing and building peripherals that do not support PLDM firmware updates. So what are the lessons learned here? Redfish plus a BMC and a peripherals that support PLDM firmware upgrades will give you a consistent mechanism for deploying firmware. So push your vendors to implement PLDM firmware update for all the components in your servers. Within the industry, there is a practice where server vendors will resell other OEMs PCI cards under their own label. This is known as private labeling. Private label cards generally come with a version of the OEMs firmware that has been modified by the server vendor. These modifications are generally to add hooks so that information such as temperature data, serial number and firmware version appear in the server's BMC. Modification of the OEMs firmware takes time to develop, test and certify. Because of this, the time for a particular version of firmware to make its way into a certified release for a private label card can take months. This can really throw a wrench in the works if you're trying to deploy a new technology like VRAM and you come across a bug in the version of firmware on a NIC or an accelerator that it was fixed by the OEM six months ago. So what are the lessons learned here? If possible, avoid private label hardware. You may lose some functionality like integration with the BMC, but as we already said, you should be pushing the vendors to implement PLDM for BMC to profile communications. With OEM firmware, if you do encounter a bug, you're likely to be able to get it fixed in a few days or weeks compared to a few weeks or months with a private label firmware. When VRAM is built out, we'll be looking at a fleet of up to 80,000 servers. To deploy and manage those nodes, we'll need a management platform. And our expectation was that we could support 2,000 nodes per management cluster. This will result in 30 to 40 management clusters across the network to manage the entire fleet of servers. Remember what I said earlier about Verizon scale breaking things? The reality is that we're only able to get 200 to 250 nodes per cluster, which means we'll now need 300 to 400 management clusters. So what lessons do we learn here? In this case, the open source platform that the management clusters built on is not the issue. Instead, the limiting factor was a feature that the vendor had added on top of the open source system. In this instance, it was a data collection and analytics platform that was significantly slowing down the deployment process. And removing the add-ons has significantly improved the deployment process, getting us closer to our target numbers of management clusters. When you're deploying a nationwide system, you need to be able to achieve some velocity. To minimize deployment times, we developed a zero touch provisioning system. So what lessons do we learn from developing a ZTP platform? Make the install document as simple as possible. We initially had a high number of first visit failures. Our field engineering team pointed out that the install document was too long and had a lot of cross references making the install process hard to follow. Reducing the install document down to a single page has dramatically reduced the number of first visit install failures. Make sure the hardware you test within a lab is the same hardware deployed in the field. The router hardware used in the development of the ZTP process was newer than what was deployed in the majority of the field locations. The routers in the field only support one DHCP helper address, which was already assigned to another VLAN. OK, plan B, just enable IPv6 on the BMC and let's slack-assign the BMC and IPv6 address. Although the BMC supports IPv6, for some reason it only supports systematic IPv6 address assignment. So for the short term, we reverted to manual configuration of the BMC IP address. Minimize the initial boot image. Most BMCs do not have enough memory to store a full server image. So use a minimal image to get the node up on the network. This will reduce the overall install time, as once the image boots, you'll have access to higher speed interfaces to load the rest of the server platform. Test, test, test your ZTP process. Failures during the ZTP process can leave the server in an unknown state. Best case, you may have to start the whole process again. And in the worst case, you may have to dispatch someone to repair the server. Our target has been to provision new nodes in batches of 2,000 at a time. So far our biggest obstacle in achieving 2,000 nodes in a batch has been performance of EDD. The number of key value pairs written into EDD for each node is enormous. Multiply that by the number of concurrent nodes being deployed, and you can see how it can spiral out of hand quickly. The second issue we found is how inefficient the controllers go about adding nodes to the cluster. It's not uncommon for a node that is 99% complete with provisioning to wait 20 to 30 minutes for the controller to finish provisioning. Instead of finishing nodes that are near completion, the controller will often start provisioning new nodes. This dramatically slows down the provisioning process and it revolts some provisioning failures for a high number of nodes. We were working with CNCF and our platform vendors to make changes to the provisioning process. Also transport latency between the controller and the nodes can dramatically increase the time it takes for a node to complete provisioning. What takes a minute or two in the lab ends up taking five or 10 or even 15 minutes in the field. So don't underestimate the impact of latency. The VRan VDU runs a real-time operating system or RTOS, and RTOS is needed so that the VDU can complete all the necessary tasks needed to keep track of and process data coming from or being sent to the user device in a specified amount of time. If you don't complete all the necessary tasks in the time allotted, you start dropping frames. Frame loss can impact the user experience and if it's high enough may force the device to re-register with the network. So what are the lessons learned here? Always check the priorities of the application processes versus the kernel. In our case, the developer had assigned the VDU process a priority of most important, which made them run at a higher priority level than the kernel. This meant the kernel was being starved, which eventually resulted in a reboot of the node. Changing the application priority on the VDU to allow the kernel to run necessary cleanup tasks resolved the reboot issue. A few slides ago I was talking about STD being a bottleneck in our node deployment process. One of the key concerns was the amount of data written to STD. While trying to find the bottleneck in the provisioning process, we learned a few lessons. The first one is the STD key space quota size. Now you might not run into this unless you're provisioning thousands of nodes, namespaces or pods at a time, but running STD out of disk space is bad. Ask me how I know. A preventive maintenance routine to clear out old or unused data and then defrag STD will keep it happy. Let's give STD the highest performance storage you can. While conducting tests on large deployments, we saw a lot of pressure on the local storage subsystem. This eventually led to us using NVME drives to improve write performance for STD, which brings me to the last issue, instrument STD and monitors performance. Pay particular attention to the disk performance and when you see it drop, run the preventive maintenance routines. While we're still on the subject of STD, resist the temptation to play with it. STD is the heart of your cluster, so don't mess with it unless you know what you're doing. Making changes directly to STD can cause nodes to be ejected from the cluster, deployments erased and containers to stop running. You can make the entire cluster unstable and therefore unusable. I'll say it again, unless you know what you're doing, don't meddle with STD. Okay, now that you've been warned, if you still want to make changes to STD, make a backup and be prepared to rebuild cluster. Seek help from someone who is intimately familiar with STD and its use in Kubernetes and finally test, test, test in a lab before you push your changes into a production environment. For CNF deployments, we generally see 20% less CPU and 30% less memory realization than what was requested in the manifest. This is optically inefficient and waste resources and to combat that, we started asking applications to scale back the size of their initial deployments and use auto scaling to address increases in network load. So what are the lessons we've learned here? Understand how auto scaling works with your application. How does your app handle resource derivation and how fast can additional instances of your application spin up? If you're using vertical auto scaling, understand that VPA does not update resource configurations for existing pods. Instead, it kills the existing pods and recreates them with the updated resource requirements. This can have the effect of decreasing capacity before increasing capacity, but your mileage may vary. Don't use the same metric to trigger HPA and VPA. Doing so will create conflicts within the cluster while HPA and VPA attempt to resolve the scaling request. If you are using cluster auto scaling, take into account how long it takes for additional nodes to spin up and determine if your application can survive the resource derivation while additional nodes are spun up. If you can't wait that long, consider increasing your resource request to give you some relief. Managing resource utilization is a never ending process so don't expect this to be a one and done for your applications. Everyone wants to add a label. Labels for storage classes and labels for node capabilities are the most common, but recently we've seen an increase in the number of labels that CNFs are requesting. The issue here is a lot of these labels are three or four letter acronyms and we've already had a couple of label collisions within a cluster. The lesson here is to define a labeling screen for your platform and publish it and use it. Kubernetes is very dependent on DNS. So much so that this disruption in DNS services can shut down a cluster. We recently encountered this problem for the first time. At first we thought we had a routing issue but further investigation pointed to DNS. The root problem was that the DNS instances were not able to keep up with the amount of traffic being generated as new CNFs were being deployed. Further investigation uncovered two issues. CNFs deployments were pre-provisioning DNS entries for positive didn't yet exist and the service mesh was not honoring the TTL value supplied by the DNS. So basically we DDoS our DNS implementation. So what are the lessons learned here? Don't create DNS entries before you need them. Validate that all services honor the TTL supplied by DNS and understand your DNS infrastructure capacity. I love using Redfish, but I do see opportunities for improvement. In our environment we have several different server vendors and the latitude in the current Redfish specification means we have to deal with the differences in how the vendors identify the instance in their servers. For example, one for vendor A, sell for vendor B and system invented for vendor C. This is madding when you're trying to write automation to manage a mixed fleet of servers. The lesson here is that if you deal with a mixed fleet of server hardware, you'll have to find ways to deal with the shortcomings of the Redfish specification. One option is to add code to your automation and orchestration determines which vendors B and C you're talking to and then adjust the API syntax accordingly. You could also group your servers and then write code specific for that group. But this all seems a bit fiddly to me, which is why we're pushing our server vendors to adopt system as the instance name in their implementations of Redfish. There seems to be a growing trend in Kubernetes environments to put more and more of the monitoring tools and systems in the cluster. This leads to a scenario where the monitoring infrastructure is running on the platform it's supposed to be monitoring. For my own experience, this is a bad idea. When the monitoring system is running inside the system it's supposed to monitor, you can often end up with some really weird use cases. A good example of this was a recent event where a CSI issue prevented containers from writing to the persistent volumes. Because the PVCs couldn't be written to, the container responsible for sending alerts wasn't able to pick up the message that indicated that the CSI was experiencing issues. This resulted in us moving all monitoring, alerting and inventing off the production clusters and onto its own cluster. We've also implemented heartbeats in the telemetry stream so now that we also were able to determine if we're still listening to a live system. Since we did that we have had better continuity for monitoring. Does this cost more? Yes. Does it add more traffic to the network? Absolutely. But it gives our ops and engineering teams better peace of mind. We've reached the end of my presentation and if you stayed with it this long I'd like to say thank you. If there are any questions please feel free to post them in the Q&A now and I'll try to answer them. Or feel free to reach out to me via my contact information at the front of the deck. Thanks and I hope you enjoy the rest of the conference.