Good morning, good afternoon, or good evening.My name is Takuya Miyazaka.I am a network engineer in KDDI,which is a Japanese network operator company.Today, we will talk about the network anomaly event reductionand optimal resource control in cloud-native network functionswith my co-presenter Takuya Miyazawa.Actually, we would like to attend this conferenceon-site, but due to the COVID-19 restrictions,it became very difficult for us to go to Spain,so we appreciate the event management teamfor preparing this remote presentation.This is an introduction and our motivation of this talk.The use of cloud-native network functions in the telco industry,such as 5G and MEC, have started to emerge.From the network operator perspective,the benefit of ZNFsinclude quick sub-deployment, flexibility of operation and management,and diverse business model accommodation.File these benefits are attractive to us,but as a network operator,we need to fulfill the typicalteleco services requirement, such as higher availability,residency,and optimal resource utilization.To realize these requirements,there are some challengeswe need to tackle with.For example,to realize the extremely high availability,we need to protect the network anomaly eventand resolve the anomaly event before our customerhave bad experience.This slide summarizesour lessons shared in today's talk.In the first section,I will talk about the network anomaly eventpre-detection in the 5G network,using the EVPF and deep learning technology.After my talk,Takaya Miyazawa will sharehis experience on the optimal resource control in ZNF.So,let me begin the introduction of the network anomaly eventpre-detection part.The network anomaly eventsare inevitable for network operators andokay almost every day on our commercial networks.Typical example of the network anomaly eventincrease network latency and increase the packet loss rateand increase the CPU utilization rate.So,this network diagram shows the 5G core network on the kubernetesenvironment and let's consider an example of a packet lossevent in the control plane of the 5G core network.So,for some reason,the packet loss rate in the control plane of a 5G corenetwork built on the kubernetes starts to increase at some moment.If we don't take any action on this issue,the packet loss rate will gradually increase and at some point it will have a significant impact on the communication between network functions in the control planeand finally it will cause a major problem in customers 5G services.To avoid such a situation,the monitoring system needs to detectthis network anomaly event as soon as possible.This is our motivation.Our motivation is to detect these network anomaly events rapidly and precisely.But is it possible?Our hypothesis for this project wasbasics metrics such as CPU utilization rateand not enough to realize the rapid and precise detection.This is because the amount of information of such a basic metric isvery limited and we consider it insufficient for the futureprotection by deep learning.So,our idea wasEBPF observability.So,before jumping into the details,let me show you an overview of our approach.Our approach consists of two parts and our target application is 5G corecontainers which is built on the kubernetes on top of a Linux kernel.Firstly,as i said in the previous slide,we used theEBPF functionality inorder to collect the percontainer detail metrics.So,this diagram shows an example ofEBPF metric and shows the number of TCPtransmission of a 5G core network functions container like AMF.And this,you know,the ebp metric are measured under an anomaly eventthat the packet loss rate increases gradually.So,as you can see from this diagramas the packet loss rate increases,the number of tcp-dutrace mission alsoincrease due to the tcp packet drop.On the other hand,this observationalso implies that it may be possible to protectnetwork failure from these ebpf metrics.Secondly,AI and ML.We used an AI ML model from the ebpf metrics to protect the future key indicator of 5GqoE4 customers.This example diagram shows the protection of thenumber of UE-registration failure in 5G network.If we can protect the key indicatorprecisely,we can detect the anomaly event that violates a stressful or SLA.So,these two things are overview of our approach.And from the next slide,I'd like to talk about each part in detail.So,ebpf.This table showssome of the ebpf metrics collected in our work.So,we collected some cpu-relatedmetrics like LANQ latency.Also,we collected several tcp-relatedmetrics such as,you know,the tcp-dutrace missions.So,next,these two graphs shows the ebpf metric behavior when an anomaly event occurs.The top graph shows the behavior of LANQ latency in the case of the cpu-utilization rateincrease event.And the bottom graph shows the behavior of tcp-retransmissionin the case of a packet loss event.And,as you can see from these examples,so,several ebpfmetrics are positively correlated to anomaly event.Therefore,we expect thesedetailedmetrics can be applied to deep learning for the future prediction.Next,is the detail of the AIML path.We use the LSTM model,which is a recurrent neural network-baseddeep learning model for the future prediction of time-stream data such as the number ofUE registration failure.As the data input of LSTM,we apply the basic metric of each containercorrected by C-advisor and ebpfmetrics.And,using these past actual metrics as input,LSTM can be usedto predict future metrics.This slide shows the variation result of LSTMin the packet loss event in the control plane of the 5G network.In this variation,weシミュレイディed many anomaly events in our experimental 5G network and collectedmetrics of 5GC network function container for the LSTM learning.Our first evaluation result showsthe ebpf enables more accurate prediction.In the case of ebpf and she advisesbasicmetric,accurate prediction was achieved at 150 seconds after the start of anomaly event.On theother hand,without ebpf,at the same 150 seconds time point,the accurate prediction was not possible.This upper line shows the actual value and the lower dashed line shows the protected value by LSTM.So,as you can see,there is a large estimation error without ebpf.So,although thesedeserts were obtained in our test environment and not evaluated in actual commercial networks so far,butit is expected that the ebpf will enable rapid and accurate detection of anomaly events in 5G network.So,this is a summary of my part.Thanks to ebpf,detailed percontainermetrics can be collected even in cnf's 5G network.And our first evaluation result showsebpf and aiml will enable faster detection of future network anomaly event in cnf's 5G network.With this,Ifinish my part of the network anomaly event prediction in 5G network.And from the next slide,Miazawa will talk about optimal resource control.Next,I'm going to introduce our identity on optimal resource control in cloud-native network functions.And name isMiazawa from NICTJapan.This is the summary of this talk.NICT operates a public network desperate called JGN in Japan.There are several access points,but this time we have created severalvirtual machines and constructed a small scale of virtual networks in public region,middle and north part of Japan.The virtual network consists of several nodes,each of which has several virtual machines.monolithic network functions are directly deployed on a virtual machine as VNF.Cloud-native network functions often called microservices are deployed on containers created for a virtual machineor hostOS as CNF.We have installed HCE open-source manual OSM on the JGN public region and it operates these network functions.We have also deployed our AI machine-learning engine for analytics of data such as ncpu utilizationand uncompetition and resource control,especially for CNFs.And connected to theOSM system as a northbound component.Important thing in this talk is that we have implementedadditional interface,especially for autoscaling.That interface is the main topic in this talk.This is a framework of AI-assisted and computational resource control and management for network functions.This framework is compliant with ITUT-wide of 3,177 standards.AI machine-learning should be installed into both data-analytics and resource control decisionto provide high agility to find a solution of dynamic resource adjustment.The data-analytics system analyzes and predicts the resource usage based on measurement results obtained from theanaly network and or cloud infrastructure.That is usually completed in seconds or in minutes by utilizing machine-learning.Based on the analysis result,the resource control system decides a solution to execute resource arbitrationamong services and or network function migration,which is completed in seconds or in minutes by means of machine-learning.Our goal is to simultaneously achieve maintenance and enhancement of quality of services, effective utilization ofcompetential resources,and agile processing of data-analytics and resource control in the time granted of seconds or in minutes.This figure illustrates a verification of a modern network infrastructure consisting of four network nodes,7 links and 2 end hosts,that we have constructed a part of it on the NACT public network desperate called JGN,Hawk Recreation,and constructed other parts in Kokore city in Tokyoand interconnected both networks on around 400 kilometers.We have installed anHC open-source version 10 into our network function operation systemby Kubernetes installation.As northbound components,there are AI engine,OSM client,and visualization system.The AI engine obtains data from theanaly infrastructure such as CPU digitization of each network function,analyses the data,and decides the solution for adaptive resource re-allocation to each network function.So,what is working or insubmitting in the OSM version 10?Firstly,it cannot obtain some information that is necessary for adaptive resource allocation to CNFs,with higher accuracy and higher scalability.Concretely,the OSM cannot get the current real-time situation of fluctuation in CPU digitization in every second,although it has a function of autoscaling.Secondly,it has no function of designated scaling,for example,designating another virtual machine to migrate a containerized network function from the current virtual machine.Precise,it is impossible to designate the amount of CPU resource to be increased or decreased to each CNF.Thirdly,the manageability is relatively low,especially in terms of support for visualization.Firstly,the operability is relatively low,especially in terms of arranging the form by manual operations.For example,calling a web API and receiving session required.Finally,no AI ML function to analyze various data to be reflected in network function operation.So,to solve the issues that I showed you in the previous page,we have added two stars.One is an additional interface for a function enhancement called the direct interface.The other one is our AI engine to automate CPU data analytics and the computational resource control for CNFs by utilizing AI and machine learning techniques.I'm not going to explain the AI engine in this talk,but briefly speaking,we adopt several machine learning algorithms such as support vector regression,RASU,and encoder decoder recurrent neural network for time series CPU data analytics and automatic resource control decision.This talk focuses on this direct interface for a functional enhancement.This is an overview of a direct interface which is added to the OSM,the non-span components such as AI engine and the visualization systemMiddle components such as VIM,quantities,and NFV infrastructure are connected with each other by the direct interface via Kafka bus.Without going through the OSM internal components such as life cycle manager,resource orchestrator,and so on.The main objectives of this traditional interface are to obtain necessary resource information,resource arbitration among CNFs,and NF migration from server to another server.In other words,the additional interface may only be effective for these purposes,and the original OSM interfaces and internal components may be effective for all other objectives.So,it should be better to equip both interfaces in the system just by adding the direct interface to the current OSM.So,what are the benefits of a direct interface?The first is to obtain a VNF CNF CPU utilization in every one second.We utilize C-group to realize it by tuning for each VRM,which means each a quantities cluster.We can obtain higher accuracy and scalability in obtaining a real-time situation of CPU utilization of each network function.The second one is this unit is scaling.In horizontal scaling,we can specify a virtual machine to increase support for each CNF.The current OSM Kubernetes cannot do this without our direct interface.In particle scaling,we can specify the amount of CPU allocated to network functions.Again,the current OSM Kubernetes cannot do this without our direct interface.The third one is higher manageability.It's supposed to visualize a lot of information obtained by multiple OSM northbound interfaces.The fourth one is higher availability.We have a convenient compound DCTL,which enables easier measurement of processing timeDue to a function to designate asynchronous or synchronous to each request.So,finally,general on conclusion for two things.The use of CNFs in the telco industry has been receiving much attention in recent years.Why CNFs enable quick services requirement?We have still been investigating a solution to realize the typical requirements of telco services.Such as high availability and optimal allocation of the confidential resources for CNFs.In this session,we have introduced our experience in utilizing CNFs as an infrastructure for telco services whilesatisfying these typical requirements.In the former part,we have introduced lessons learned on EVPF observability5c core network deployed by Kubernetes,and showed that the detailed metrics of each 5c and container measured byEVPF enabled the detection of future anomaly network events in the 5c with deep learning.And then we have introduced our ongoing activities on autonomouscomputational resource control system for CNFs.Being compliant with the GAC OSM standard,and deploying the network function operation system in the LACD network testbed called GACN.So,Thank you very much.