 Good morning, good afternoon, and good evening team. It's an absolute pleasure to be at the conference and talk about the closed-loop alternation at the Network Edge topic. It is a very critical and pertinent topic for the Delcos. And joining me today are two of my distinguished colleagues, Praveen Jayachandran. He's a senior technical staff member in IBM Research and Matthew Thomas, who's a distinguished engineer in our IBM markets for industries for telecommunication. Moving on to the next slide. The topic of closed-loop automation is critical as the carriers look for ways to transform their network operations. As the carriers transform their network operations using technologies like AI, analytics, data-driven methods, the key is to make sure that there's a proactive way of ensuring that the network continues to function without any disruption. And so the closed-loop automation handles it in three paths, detect anomalies, proactively look for resolution, and implement those without even the system going into a disruption mode. And obviously in terms of the benefits, it's at a very high level, again, into three categories. Cost savings, significant cost savings in terms of R saved in network resolution, and therefore dollar savings. Number two, in terms of talent, moving critical resources away from this to do other significant work because this thing can now be automated and work seamlessly. And then last but not the least, in fact, the most critical is customer experience. Ensuring the end consumer is satisfied, clearly measured through NPS ratings. Hi, folks. Appreciate you joining this session with us. I'll be covering two points. The first is, where does closed-loop automation fit into the big picture of network deployment, management, and operations? The second area I'll be covering are some use cases relevant to network operation and management using closed-loop automation. So let's look at the first piece. When you look at the overall network operation and management for a 5G network, the following are some of the key steps that one has to go through. One will obtain the appropriate and relevant XNFs. These XNFs will go through the CICU process, be placed in a catalog. Then services will be created around the different XNFs that are available. An example of a service would be, say, creating a slice, a slice that will go all the way through the core, the RAND transport layer, and then that service will be made available in a catalog for external consumption. Once that service is made in a catalog for external consumption, someone could come and actually place an order for that service. And in the process of ordering that service, the appropriate components will be deployed, and the slice will be created across the different layers. The next step is where closed-loop automation comes in. As the slice is operating, we are constantly getting information from the network. That information is analyzed to determine if there are any issues currently going on or predict any potential issues that could happen. When it is detected that such an issue could happen or is happening, we do a root cause analysis to determine what the issue is, and then automate the process of correcting it. In order to correct it, we would have to call our backend, our MANO systems or other backend systems that are needed to actually fix it, and then the network should continue running. While the network is running, the same process repeats. Even in the process of rectifying an issue, if we get new information, our resolution will be different. Let's take a look at this again, but specifically from the closed-loop automation perspective and look at what are the different components involved here. What you saw in the previous picture, this is a blow-up of the picture on the right hand side. What you see here is we have our AI systems and our orchestration. It's constantly getting data from our cloud-native environment, our containers, but we also have to be prepared to deal with VNFs. We have to be prepared to deal with PNFs. It's getting information from our infrastructure, whether it be native Kubernetes, OpenShift, Take Your Pick, and it's also getting information about your network itself. This data is then analyzed by our AI system to do the root cause analysis. Once that is done, a remediation step is taken and that remediation step is possible because our systems have been trained by travel tickets and other vendor manuals and other information to determine how to fix a specific issue and it will then invoke the orchestrator in most cases to actually make the appropriate correction to the network or it could be other systems depending on the specific issue being addressed. Let's go to the next page and look at some of the key use cases that are relevant. There are many use cases that we're currently looking at and implementing it. I'll provide a high-level overview of traffic pro-optimization and Praveen will then do a detailed discussion of the slice assurance for 5G core, but just to briefly touch on some of the others, we see security violations as being a great area for closed loop automations where we're able to detect, say someone has gone ahead and done something that is not within your policy for a specific container and using AI to detect what that specific issue is and what the remediation is and actually implementing the remediation. Take any of the components on the 5G network, whether it be the RAN, whether it be the core, whether it be the transport layer or anything else. A lot of issues can happen there. Closed loop automation is ideal for doing that. Closed loop automation is very good to address issues on specific XNFs, containers, PNFs, where we need to make decisions whether or not something should be auto-scaled, whether something should be moved and that requires looking at all the data that's being generated about that specific XNF and then making the decisions and actually implementing it. We're also leading, co-leading a study with TM Forum that involves multiple CSPs and vendors. There are more use cases there and we're developing a reference architecture. So this is a field that is very pertinent to the CSPs and there are many right now looking at this very carefully because the management and operations of a 5G network is quite complex and the more AI and automation that we have in it, the more effective the system will be. Let's now look at specifically just one use case at a high level. So the general operational model for a 5G network will often look like this. You've got multiple edge devices which are then connected to the actual edge network edge where you'd have your mech components of the VDU and then there'll be the core network and finally at the back end you'll have your BSS and OSS systems. So what the traffic flow optimization does is, use case does is it optimizes the traffic across your transport layer. How does it do that? Well, it's constantly getting data from all the different sources whether it be the mech, whether it be the transport layer and the variety of other different sources. As it gets that data, it constantly monitors it and predicts that potential issues could happen. If an anomaly is detected or a prediction is made that an issue could happen, it starts doing the root cause analysis to determine what is precisely causing this problem. All of this is using a lot of AI and analytics to perform this. Once it's identified with root cause analysis, we have a system that's trained already on how to fix certain types of issues and it's been trained because it's been fed with manuals and other data relevant to different issues and if this issue falls into that category, it can using a robotic process automation directly invoke our orchestration layer or any other back end systems that are needed and the orchestration layer will then go ahead and fix the issue. So if you were to have seen the slide, what you'd have seen was within the SDN controller, initially everything's running well and we can detect with our predictive insight stools that there is a potential issue that could happen given the data that's coming through and if this continues, there could be a serious degradation in the network and so a root cause analysis is done to determine what exactly that issue was and why it's happening and once that issue is identified, the automation system will decide what the specific resolution is and in this specific case it was to make some specific changes to the SDN controller to resolve that issue, change will be made and you will notice that the network that was beginning to degrade will no longer be degrading. So hopefully that helps in terms of giving an overview of A, where does closed-loop automation fit into the overall big picture of managing and operating a network and B, what are some of the use cases with closed-loop automation can be applied to manage and operate your network. Now we're going to go into some specific details on a specific use case to see how exactly some of these concepts are applied. With that, I will hand it off to Praveen who will get into those details. Thank you. Thanks Matthews. Hello everyone, I'm Praveen Jayachandran from IBM Research and I will be talking about how we can apply AI-based analytics for service assurance and closed-loop orchestration of network functions and applications in the telco domain. So bear with me as I walk you through this specific chat. We think of service assurance in four phases. In the center you'll see this blue circle detecting the different phases. The first collect phase is about collecting multiple signals, multiple data sources from across the stack and that's key to be able to perform any sort of analytics. The second stage is the detect stage where the objective is to first detect if there is a potential problem, whether it's a fault, it's a performance bottleneck, whatever the nature of the fault is. We want to first be able to detect that there might be something wrong. And the next phase is the decision phase where we won't be able to decide what the nature of the fault is, be able to pinpoint the root cause. So that would be the decide phase. And finally, once we understand what the root cause of the problem is, then determine what is the right action to take that can suitably remedy that performance bottleneck or a fault as the case may be. So let's go through a bit of each of these phases. So in the collect phase, as I mentioned, we're going to be collecting different data sources. So we are interested in collecting metrics, logs, events, could be tickets generated from different appliances, and also topology information of how different components are connected with each other. So we want to also be able to collect data from across the stack, right from the hardware infrastructure layer to the platform layer. For instance, this could be Kubernetes or OpenShift running containers, or this could be virtual machines on OpenStack. We want to be able to collect metrics across the stack all the way up to the application layer. And then that collection gives us a rich source of information on which we can perform analytics. In the second phase, in the detect phase, we want to be able to analyze each of these data sources and kind of reduce the dimensionality of the data and be able to detect what the problem might be. So starting with metrics, we perform trend analysis, anomaly detection, figuring out dynamic thresholds on these metrics, and reducing those that high dimensionality metric data into a certain set of, a smaller set of anomalies that are easier to handle. So these anomalies would be on specific time series of the metric data, and we collect those anomalies. Likewise on the logs, we perform entity extraction, template extraction based on NMP methods. And then we extract anomalies on those log templates. So these are anomalies from both metrics and logs. And likewise, we might be getting alarms or events from different components. So these kind of, if we have anomalies present across these different data sources, that would be indicative of a potential factor. So that's what we kind of do in the detect phase. In the third design phase is where we have, we look to combine these different data sources together. So the metric anomalies and the log anomalies can actually be combined as a single categorical time series. And that could also be combined with events. So we then perform event grouping, clustering, suppression algorithms on this categorical time series. And then we can do things like temporal analysis, seasonal analysis and such. And also finally, a weighted probability cost. So amongst all of these anomalies, which of these are indicative of the root cost, whereas which others may be an effect of what the root cost is. So being able to identify a probability cost is an important step in the decision phase. Along with that, we could also combine that with topology information. So once we have a sense of the probability cost, we could combine that with topology to actually localize that fault to a specific component. So that's where we do fault localization, dependency mapping between different components. And also we also look at algorithms for historical topology analysis. So as the topology changed significantly from what it used to be and so on. So all of these algorithms can be applied at the detect and decide phase to actually come down to a root cost analysis of where the fault might be. Once we have identified the root cost, the last step is to be able to automatically remedy the problem. Here we have policy-based actions that would be captured as runbooks as shown on the right. So these runbooks are simply an event condition action kind of framework where if you have a particular root cost event that you've identified, then you could have a policy that says that if you see this event, then take this particular action. And the action could invoke an orchestrator to actually remedy the problem. So the action could be a healing action, a scaling action to provide more resources. In other cases, these actions may actually be written down in textual languages that systems reliability engineer is expected to follow manually. There we can apply robotic power process automation to automate the steps written down in a set of manual set of steps to actually automate those with RPA. Again, leveraging NLP techniques to do that. Now, what we'll do next is we'll actually see this in action in a demo for slice assurance of our 5G code. So this demo flow at a high level is we start with a orchestrator deploying our 5G core slice submit on OpenShift. We drive workload against it. We are continuously monitoring metrics from that 5G core slice submit. And the centerpiece is where we are showing the NWDAF and that, I'll go to that in a bit, is when we do the analytics to actually determine what the problem is. And then finally, we take a remediation action to actually fix that performance bottleneck that we had before. So now let's take a closer look at the central analytics piece. As for the 3GPP specification, the NWDAF, which is the network data analytics function and the MDAF, the management data analytics function is intended to be the single point of consumption of all data and analytics pertaining to a particular slice submit. So what we do here is it has a subscription interface about publish subscribe interface. So anyone can actually subscribe to data and analytics pertaining to a particular slice. And in this case, we show that our orchestrator is subscribing to the data and analytics. And internally, this MDAF, the management data analytics function, is going to analyze different sources of data across the stack, specifically thousands of metrics that we are collecting across the stack. And performing anomaly detection and metric fingerprint. So anomaly detection plus looks at each metric data time series figures out whether there is an anomaly and once we have a set of anomalies for a set of metrics, we then do a fingerprinting to say that, okay, this set of metrics are indicated of this particular failure type. And the hypothesis here is that different failure types will result in different anomalies across different metrics. So if there is a performance bottleneck, whereas there is some other fault kind of fault, then the metric anomalies will actually be different for different kinds of faults. And that will show up in the fingerprinting being able to label it as a particular fault type. Now, once we have identified a particular fault type, we then send out a notification to a particular URL and that URL is specified as part of the subscription itself. So the orchestrator is subscribed to the analytics for this particular slice, specifying a URL on which this notification should be sent. And we send that to the specified URL. And in this case, we have IBM's net pool product as the event manager, where these events are being sent. And in the event manager, we have a runbook automation that will automatically trigger remediation action, which is invoking the orchestrator to take a scaling action on the slice where we provide more resources for this particular 5G core slice submit to actually remedy the performance bottleneck. So let me just quickly walk you through the demo flow. So we start with the orchestrator, which is IBM's cloud pack for network automation. We deploy the 5G core slice submit. We are also subscribing to analytics for the slice submit through the NWDAF and the MTAF. We are continuously monitoring data collected through a workload generator. We are collecting the metrics across the stack. So this is infrastructure layer metrics as well as the 5G core layer metrics. We then perform anomaly detection and failure from the printing. And once we identify that there's a specific problem in the infrastructure in what we are monitoring, then we send out a notification pertaining to that particular slice submit to an event manager. And from the event manager, we are automatically triggering a remediation action to be taken by the orchestrator to scale up the slice and provide it more resources. So that's the demo that we are going to show. Before we actually get to the demo, just a few more details, right? So as I mentioned before, we are collecting metrics across the stack using a federated Prometheus architecture. We are collecting metrics from the OpenShift layer. We are collecting metrics from the, from each 5G core slice submit that is deployed. In the demo, we'll actually show two slice subnets, 5G core subnets slices that are deployed. And we are collecting about 1,900 metrics and all, each of them collected once every minute. And then we apply different anomaly detection algorithms to detect whether there is an anomaly on each of these different types. So we have at least three different algorithms specified that we're applying for anomaly detection. And each of these anomalies are then again, you can think of it as a time series. So if you look at this picture, each of these KPIs think of them as they are the metrics. And you have a metric time series. This is the raw data. This is a very high-dimensionality data because this is 1,900 metrics collected continuously. Then we are then looking for anomalies. So each time instant, we see a certain set of anomalies that are triggered. So each time, each time instant, it could be a different set of these anomalies. And then we analyze these anomalies for this to identify what the specific fault type is. So we then do a mapping from the anomalies to the fault type using algorithms such as TFIDF, using classifiers, this could be like a family forest, or discriminatory sequence mining when the actual sequence of alarms is actually important. So once, so this labels of exactly what the type of fault is, is provided by an expert. A human in the loop will come in labels that say, okay, these particular set of KPI anomalies are indicative of a performance bottleneck in the AMF for your 5G code. So that sort of labeling will be performed by the human. But otherwise, the entire analysis and remediation is happening automatically in a closed loop fashion. Now let's shift to the demo itself, the video of the demo. So we start with the cloud pack for network automation, which is the orchestrator. We start with that dashboard. The 5G core slice has been deployed. And now we are going to do a certain set of steps. So we're going to first subscribe to the NWDAF and NDAF saying that we are interested in analytics for this particular slice segment. We're going to also start workload generation right after that. And in the subscription, we have called out the specific slice identifiers and the notification URI where any information is texted. Here we'll see that the subscription, these are the details of the subscription, including the NSSAI, the slice identifiers, and the notification URI. Then we've started workload generation. And here we are actually generating workload across two different slice subnets, just for reference. On one of them, the workload is very high, much higher than what is intended for the resources provided. From both the slices, we are actually collecting a large set of metrics. So you'll see those different metrics here. And we are also collecting metrics from the infrastructure layer that we can correlate against. So let's, and in this particular case, let's look at the UV average session registration time. And here you will see that there are two plots, right? So one for the green slice subnet and the other is for the red one. As you can see, the red one is taking significantly longer for the UV session registration. And that's because the workload on this slice subnet is much higher than anticipated. So that's where the performance is. Now the analysis is happening in the bracket. We first use IBM's metric manager product to identify anomalies across these metric time series. Now the fingerprinting has happened and we now have pinpointed that there is a particular performance anomaly in the AMF for the, this red slice subnet. The event from the MDAS, the management data analytics function, is sent to the event manager, which is IBM's net food product, IBM's event manager. And in the event manager, we have a policy that's specified where we can inspect the specific event and we can take a particular scaling action, a 5G core slice scaling action, which is invoking the orchestrator to take that particular action. Now the orchestrator is now going to take that commutation action, which is going to scale up the slice subnet. And once the scale up is done, we observe that the metric, the recession registration time for the next slice has also come back to me. So subsequently, once it comes back to normal, the analysis actually detects that there is no longer any anomalies that are shown back. And then once the anomalies stop, we actually send a second event out to the event manager to show that there is no longer a performance bottleneck, right? The severity is set to zero. So this is indicative that the problem has successfully been solved. And if you go to the event manager, you'll actually see that this event is now green, showing that the remediation has been successful. Now let's come back to the presentation. And here in this demo, we primarily focused only on metrics. But when logs and topology information is also available, then what we can do is we can combine the metric anomalies with any anomalies obtained from logs. We can perform event grouping to actually group these different anomalies together, goes based on time, as well as based on entities extracted from these logs and metrics. And the combined group events can then be, you can then look it up against the topology for thoughtful optimization. And then identify what is the specific component where this failure is occurring, and then subsequently take the most of action. With that, I will hand it back to my colleague, Gopal, to take me through the last bit. One of the most critical things is to help the clients in the journey that they're currently involved in from a closed loop automation standpoint. Many clients have already started the journey, and many clients are in the journey and looking to see how they can mitigate some of the challenges as they progress through the closed loop automation. And my answer typically is through three T's for any of the telcos globally. The three T's are talent, technology, and tools. And if you have those three things, any telco, anywhere in the world, can take advantage of it. When you look at the first column on the left hand side, that's about talent. It's important that the carriers have the right set of skills and talent to get the job done. Whether it's AI experts, network engineers, data scientists, you need a combination of all of those to do this particular initiative. The next one is about technology, using the right combination of AI, automation, machine learning, and applying that on data when you put a lot of these things on the far edge in a 5G ecosystem. The third is using tools, new ways of working combination of agile, DevOps, design thinking to help do these initiatives and activities faster and better. And when you have those three T's, talent, technology, and tools, you can help any telco anywhere globally in the world. Thank you.