 Okay, this is a little bit extended version of a presentation I did just one hour ago for 15 minutes. This time gonna be 30 So I'm gonna talk about NFVI upgrades and migrations with critical telco workloads. Thank you so the key takeaway message I want to deliver to you today is One achievement is to develop and provision a full telco cloud solution in service that requires development integration testing verification service delivery acceptance Once you do that on the infra then you have to onboard all the applications that's quite a challenging long procedure that eventually results in service and cheerings and applause But another another major achievement is to actually maintain this cloud solution in service During the whole lifespan and that means to keep the software lifecycle management of the cloud Basically NFVI upgrades In Ericsson, we have a very old concept from the 70s called ISP in service performance This concept existed even before the concept of internet service provider, but so far it's not very popular outside Only between Ericsson and customers, but ISP basically means How is my telco equipment performing and How can I secure? It is permanently performing so there is a permanent optime so normally people relate ISP to 99.999 percent of availability the famous five nines in the Back in the 70s. There were three main industries that could deliver five nines of Quality Nuclear plants aerospace and good old telco switches For normal cold making Anyway, my name is Gerardo Martinez. I am the NFVI CICD lead solution architect There is a program in Ericsson called NFV program Where we we are basically a very dynamic team that Takes a challenging job of combining the Ericsson infrastructure with the Ericsson application Putting them all with all together and making sure that applications and infra are gluing as expected In in all in all program in all program We don't particularly deliver a solution We just make sure that the applications and the infra meet the requirements and the expectations or later Other areas in Ericsson can develop 5g core solution IMS etc This is a very dynamic team. It's a spread all over the the world on all the Ericsson R&D sites My day starts very early in the morning with Getting in touch with Chinese colleagues Asians Indians and it ends with people in North America But basically we can put all this in an equation called NFV And NFV is just a combination of NFVI VNF's the so-called open-stack BM based applications CNF's the containerized ones and LCM life cycle management Some other people also like to refer LCM as mano and include the orchestration part So you might hear about NFV mano As a process that involves both virtualization technology and orchestration What do we do? We basically test with the latest Ericsson NFVI software before it's released to our customers We perform LCM activities So basically we do a full NFVI upgrade of the infrastructure with applications on top running traffic And after that we perform resilience tests The typical resilient test that every customer would like to see what happened if I reboot this compute what happened if I reboot this switch And and things like that After all these resilient tests are done and the upgrade is done we we give our recommendation to the whole release process of NFVI Software solution and this one is released and reach what we call GA general availability So then the solution is available to our customers What are the forces behind upgrades? Why do we upgrade? What are the triggers? So first reason because we want a new feature There is a feature that we cannot get in the current software. So we upgrade our system in order to get certain features certain capability Second reason and widely accepted today is because we want to keep up today with the security fixes We all know that from here until the future that we all we always a New vulnerabilities is covered and it's almost now a routine that every three months. We just have to do security fixes But fixes that is not probably the the favorite of our customers Our customers would not like to upgrade Assistant just because there is a fault to be fixed But we also tried to put the fixes there and finally life cycle and and support Which means the the cloud is running a little bit too old to the point that is not There is no business on maintaining such an old software and therefore the the vendors they say We we can support you until this date and and after this date We cannot just keep maintaining the software providing support Then we convince our customers they decide to go for the upgrade So what is our customer's expectations? Well, first of all, they expect a carefully Verified upgrade procedure. They really expect what we are about to do is return prepare Very clearly explained to them because they are running life traffic all your phone's data are Most most likely connected to this live telco They expect downtime management You tell them this is the amount of time that the system will be out of service and you have to deliver that promise Control risk if something doesn't go as as expected then customers expect that you Go back to a safe point. You cancel the activity you postpone it and And and you try to understand what went wrong Sometimes upgrades, especially in cloud could be either not possible Was just before in a talk about how obsolete your cloud can be or To disruptive the the downtime is so Major that it's better not to perform the upgrade with life But instead you reroute the traffic you do traffic off load try to move the traffic somewhere else and then we also fall into Famous warding Situation what is exactly upgrade why we don't call it update and There is this famous story that somebody reports how I have upgraded my Kubernetes cluster to 1.23 And then you ask okay, so you did this rolling upgrade procedure with the application inside and then they might say no No, I just delete the cluster and restore a higher version and I restore everything again Well in telco, we don't call that an upgrade that we call a reinstallation or I prefer to call that an uplift Meaning you are bringing your cluster Your open stack to a higher software level, but you are not really doing a life upgrade Then having say that let me show you a bit. How is more less the Ericsson solution? But just trying to point out the the generic Open source protocols we we support overall The Ericsson and FBI solution is composed by hardware switching fabric We meet all these standards. I triple E triple E RFC's famous protocol eat me It's an a standard, but it was strongly driven by the by the industry the early 2000s rack scale design Redfish Then you have the virtualization layer Open stack open daylight open virtual switch and then on top of that you have the containerization layer I I guess you have heard in the summit. They use this terminology Loki linux open stack Kubernetes infrastructure wealth is a it's another way to represent what in the end is a normal practice in the industry Still we we are also trying to deliver solutions where we tried not to have the open stack layer in between That is something that we are also working on And there we have the the CNF's and this and the BNF's Inside each of these boxes in our program There is a team of two three persons sitting somewhere in the world who are experts on that box And that's a very cool thing of this program It's a crossroad of many cultures many countries many R&D centers under the same cultural Company culture. Let me talk now about some telco industry facts and practices Why the telco industry so special? First of all regulation The telco industry since the very beginning has been a regulated industry You you cannot become an operator a telco operator from your garage. You cannot build an a start-up And call it telco operator that would be quite difficult you'd normally need to go to the government you need to auction for a bands of Frequency or get the right to make a hole in the street to throw a fiber, etc It's a complicated process. So therefore there is regulation And there is a very famous factor a link to to telco industry which is the emergency services They're so-called 9-1-1 in America 1-1-2 in Europe Emergency services have binding legal commitments meaning You cannot bring down that service for any reason whatsoever They have been histories on the news that when this service is down in some countries The the the country deploys the police into the street just to listen if there is some emergency happening So this is a very serious thing that can never be out of service okay a Standardization This is quite different to the way how open source works in in the standardization industry the standards are Built first and then the products are implemented later So there is a big discussion about building certain the standard the rules of the game and after that discussion is agreed Then everybody start the race to build That this is very important on the low layers is still today later one two three. I mean for optical fibers I mean you have to specify even the Characteristics of the connectors, etc. I mean it's it's not something that again. You can decide by your own And and we have a very long tradition Of thinking in that way so open source is something isn't we couldn't we cannot say so new anymore We have been no more than 10 years Adopting open source, but it's not from where we come from There they used to be this telco genetic requirements In the 70s when the Bell companies split in us There were the so-called baby bells and then there was a core that was split into the bell labs and the bell core The Bell core nobody talks about too much, but Bell core is the one that tells you how how should be the rules of the game Then Bell core became a company called telcordia and they have these famous GRs genetic requirements, which is what defines if If a product is telco grade or not those requirements still exist Interestingly telcordia was acquired by Ericsson in 2012. It became part of Ericsson and Today telcordia has evolved in what we could call automation network operation. So all the whole Manu industry comes from from telcordia Among all this generic requirement There is a famous concept that we also like in telco a lot which is called the first office application concept the four Four means first time so in the world when an operator takes the confidence Relation with the vendor to do a first time in production that is called for the product is not yet officially released It's only released to this particular customer Then the product goes live and after the product goes live something very interesting comes Happens that all the tier one tier two tier three operators start to talk to the FOA customer and They start to get feedback about that product and that feedback will be very important to define later. How good is your product? That that is something very important in the industry Once this is done The FOA is successful then we can declare this GA again. This is more or less the way how telco has been working all the way Finally telco usually cover a geographical area Contrary to it that you go almost immediately right worldwide in some cases or The coverage of IT industry can be very vast telco operators come from the fact that they cover a geographic area So therefore when you want to do upgrades you have to do them When the traffic is low and then the concept of maintenance window comes into place the night When people is sleeping Weekends moments in time where you can take the risk of performing an upgrade Who great happen therefore during low traffic periods night weekends, but they don't happen all the time They are very important steps for customers today. So We are not there yet There will be a moment where CICD will be adopted by the telco industry in that moment We could assume that upgrades happen all the time, but today is not the case so having said that we can see a summary of The the steps I just described you prepare few days you do some offline work you do some diagnostic in the data center And then the night of the upgrade comes you do this activity post checks if the things go, okay You confirm complete the upgrade if something goes not as expected, then you roll back Then let's assume the upgrade is successful. There is some offline work offline work again post monitoring alarm check-ins, etc The site can never be left broken in a change activity cannot just Leave it broken if it's gone if it's Upgraded and go Either they change complete successfully or you have to roll back to an estate that is safe and sometimes Within the night the matter of fact is today is that upgrading a data center in one night is literally impossible And then we fall into the challenges We get many questions from our customer How or how big or that a center should be we would like to have a large data center or many a small one as well one criteria to decide how big your data center should be is How long you are up to take for the upgrade for the software maintenance So larger data center the upgrade will take longer And if we start to consider that You cannot upgrade every night, but some nights like the weekend you might end up taking months to upgrade a data center In some cases it could be the case So have to be very mindful in talk about the size of the data center sometimes to be could result in very long upgrades multi-tenancy challenges Our customers they want to exploit until the last core available of the compute Therefore, they would like to deploy all these VMs and Kubernetes clusters in a normally in a static way We are not like the public cloud where you can just create a cluster with a button. Normally all these pros are deployed fix But then you fall into this so-called a Tetris game we call this the the Tetris meaning that Some computes might have been from one BNF and another BNF and what happened if you bring that compute down You might fall in infinite combinations of consequences So therefore you have to know very well what you want to put together Because when that compute is rebooted you might affect two BNFs that might end up affecting the What we call the call pass the traffic flow another important thing in backups A backup that doesn't restore Is not cannot be called itself a backup So the main purpose of a backup is to use it in case of emergency and In in in cloud technology sometime that backup can be the image of BM, which could be quite very big file or databases sometimes Exporting those backups importing those backups could take longer than just redeploying a system and Then we fall into something which is very sensitive to our customers if we have to roll back sometimes It's better to redeploy than to restore from the backup and This in telco is a very sensitive issue because redeploy is something that at first is is sounds scary But with the proper Explanations you can explain to the customer that redeployment of a then stateless cluster might be faster than Than just restoring the buck. It's still we do backups. I mean you never know What is the best way in the end? But yeah, that's another thing to keep in mind Traffic resilience. This is a good one Sometimes you discover resilience issues in your deployment when you do the upgrade Because a very golden rule is if your system is resilience then your upgrades will be successful so upgrades about resilience it's about Cutting the resilience of your system so you can upgrade the passive side and then the active side can can stay and then you Can switch etc. So Quite often it happens that during the upgrade you discover that there were some resilience situations that were present before the upgrade But because you happen to be doing the upgrade you Exposed them and then the upgrade gets this impression that okay The upgrade is failing and then you find out later in the post mortem node The great was doing what it's supposed to do, but the system was not resilient as a suspect Some configuration issue could be anything another interesting one migrations or What we calling Kubernetes pod evictions? So let's say I want to upgrade this host Then you have the possibility to bring down the VM leave it on the host Sleeping until the VM come until the host is back or you can migrate the VM leave the compute empty to the upgrade Here I like to say I like to always make an analogy Open-stuck VMs to me. It's like a football Because the VM you can imagine like soccer ball That you can everybody knows where it is everybody knows that the role of the game Normally in a BNF you can remember in your mind the number of different VMs that are in forming that BNF So it's like a soccer match very simple to watch. You know where the ball is etc What is the purpose of the game with containers? I make the analogy without falling into this popular what is more popular soccer or pool But when you contain this is like playing pool. It's like you get on the stick You hit the the white ball then the white ball hits the other 10 balls and you have six holes and And all this container just move around when you do this eviction and there you have to be very careful because Most of the cases what you do a drain and you can fall into a situation where one container doesn't want to leave the cluster And that is connected to the next topic Pod disruption budget, this is something that is very important in telco applications every applications have to do the right homework on on pod disruption budgets Because not doing this properly could result in a container saying no, I don't want to leave the worker I I am too important to to live I need to stay in service and then you can fall into a situation where your upgrade is Hanging because that container doesn't want to leave. So very important this concept. Don't please don't forget it pod disruption budget What else? reboots power off and power on very famous telco requirement and Equipment has to always recover by itself if you have a sudden loss of power This this requirement comes from the fact that many telco equipments are a spread in in very harsh Geographic places and the expectation is that if you get a power loss and the power is back The system should be back without manual intervention. That is that is a very tough requirement to meet Telco requirement to meet and Then you have these reboots and customers will not like reboots reboots is it means time Losing time waiting Here we have to always make sure that we try to align all the reboots if we're gonna upgrade a compute We try to make sure the number of reboots is the minimum possible because Customer doesn't want to hear that you are rebooting a compute three or four times during an upgrade That is not efficient and furthermore, well, I will Explain a bit all this what I call the advanced level For example elasticity versus rollback when you start to upgrade a cloud your VMs or your containers start to move around therefore it's Inefficient and I will say very very difficult or even more painful to put the things the way they were before Than just to leave them as they are So rollback and elasticity they have a bit of a compromise there So it's very important to convince the customer that the VMs gonna move And if we decide to stop here for the night There is nothing wrong that the VMs are sitting in a new place as long as the service is there again downgrade versus redeployment Downgrade could be also related to rollback But in general redeployment is something that needs a lot of careful explanation to our customer Why we have to redeploy why we have to? Go back to the previous software in the IT industries and sometimes you might find Products that do not support rollback that you you really have to leave them as they are in telco This is very hard to digest, but that's the case Hardware interdependencies so something we have also noticed is that There is this area of uncertainty between NICs firmware and Linux kernel drivers Sometimes you might be surprised that one depends of the other in a different way as you expect So sometimes you might have to upgrade the firmware before the driver or sometimes you might have to upgrade the driver before the firmware and that is a bit There you have to be careful and read very well Why you are upgrading the firmware why you are upgrading the driver and finally NFVI migrations driven by traffic rerouting Sometimes it happens that The whole process of providing traffic into NFVI is not happening on the NFVI, but in the IP network in the in the In the routers. So the process of offloading traffic occurs in a different layer than the data center and telco customers have Issues to to accept this because the traffic department normally is a different department than the one that is managing the NFV eyes so When you do an operation in one layer, you know, you don't want to be depending on another layer that you have to bring the experts there and the same happened with Open stock of grace and the application something that is very difficult for one night To have the application expert the infra expert the routing expert all sitting together waiting to see if there is an issue so this is Very important to to consider Finally, I would like to close with the fact that we did a new success successful upgrade and What happens when you upgrade? Goes through and is successful And I can tell you as a personal experience after upgrading the same lab for two years every three months Basically you end up attached to the lab You feel that the lab is your pet and Then I would like to ask you how many of you knows about this famous story cut a pet versus cattle Yeah, it's a very famous story that you should because especially because cloud tells you you have to treat the servers as cattle Not a sped you cannot fall in love of a compute that is not good Because then the compute becomes untouchable, but now let's talk in a higher level. You have a data center with 100 compute Let's assume that every compute is a cattle But you fall in love of the data center and the data center becomes your pet You know the story you saw it grow you you know the strength you remember the compute that fail and you Do you know what is the leaf or the spine that gives you more trouble, etc? So you treat it as a pet But we are an industry and we have to deliver cattle So procedures first they have to work equally on every data center All that the centers have must be fully predictable although the upgrade procedure will never be a unique procedure Every data center in a way is different You you expect to provide something that will behave the same everywhere And they also grow and develop so how do you solve this problem? How do you deliver cattle and pet at the same time and I'm trying to solve this problem today and my answer is horse Why horse? No pet no cattle horse Well, it's maybe a nice way to to end the presentation You cannot have a horse at home Luckily for our vegetarian colleagues horse are not as popular for for meal and And horse can be very well trained animals They I am I had a recently the impression of How good they can be trained and this is what we expect from a cloud to behave as a horse So I would like it finally to thanks Bread co-food who could not make it to the summit, but he was helping me to build this material and There you can reach me on LinkedIn or if you want to talk to me My pleasure Thank you Any question? Ah, maybe we should have a question session. Yes. Sorry Well, yes, so the question is how to mitigate some of these challenges. The first thing that comes to my mind is that Although a Loki or NFVI system is composed by multiple layers We we would like those layers to have certain intelligence to communicate to each other during the process So you can make a vertical code and say one compute host is connected to a switch It has open stack Kubernetes and an application on top Ideally, you would like that all these layers are aware about what is going on So then the layers can prepare Better for for this resiliency scenario So obviously I'm talking for example read bull and reboot alignment is one way to Mitigate this so you you try to make sure everybody takes a new software But you want to hold the reboot as long as possible. So when then you do only one reboot, okay upgrade time So this is always a good one. I get us all the time. How long this upgrade gonna last and the answer is The the great time doesn't fully depends on the infra It depends on what you have on top on the capability to survive Resilience the scenario because if you do parallel upgrade of all the computers at the same time, you might finish very quick But then you you lose the service so It's very hard to measure duration because you really need to know the data center and then come with a diagnosis and What what we want to do? We don't want to say it takes this amount of minutes hours or days We want to talk in maintenance windows and that's why I brought the subject into the presentation because The ideal situation is first to ask the customers Tell me how is your? When is your maintenance window? How many hours are because it's not the same in a big city than in small town and then Coming to a the proper measure of maintenance window you can say okay I can deliver this upgrade to you in this amount of maintenance windows so in the end there is not a An easy answer Edge Edge cloud they probably gonna have faster upgrades because we always expect the edge Will be smaller But yeah, we we have so much in the portfolio just And and we have also a cloud run happening as we speak Because yeah What is expected in the future is that? Any base station will be like a mini data center, right? So Any other question? We run out of time. So yeah, I guess we are five minutes late. So thank you very much Enjoy the show