 Hi, I'm Avnan Brnok and today I'm going to show you how platform telemetry enabled with operators can be used to maximize service availability for latency sensitive workloads in a Kubernetes environment. Thanks to the team to help make this work happen. Latency sensitive Kubernetes applications depend on access to resilient hardware resources to allow fast low latency packet processing. Monitoring and exposing the health of the underlying platform is key in maintaining application performance, allowing the network operator to react to issues and maintain SLAs. And while this monitoring and reaction can be done manually, this demo showcases the necessary components for a zero touch automated infrastructure, combining Kubernetes enhancements, the operator framework and host telemetry. Close loop automation is the process of identifying a set of metrics, measuring them, detecting anomalies and correcting them via necessary action. This process of close loop automation helps reduce the operational expenditure. Within OP NFV, the close loop automation working group focuses on NFVI based close loop use cases and helps catalog capabilities, identify gaps and provide reference solutions. Using components from the InfraWatch project, Collecti and Prometheus operators, Host Health Telemetry is streamed from the platform and used to make informed scheduling decisions with telemetry aware scheduler in the Kubernetes control plane. By automatically reacting to issues in the network and intelligently orchestrating solutions, this minimizes application downtime and as a result maximizes service availability. For the components used in this demo, Intel PMU and RTT Collecti plugins available in OP NFV barometer are used to monitor CPU cache counters and memory bandwidth usage. These provide an indicator as to the health of the platform and there's more info on this in the accompanying demo video. Operators from the InfraWatch project are used to deploy and configure Collecti and Prometheus, providing up-to-date metrics from each cluster and node. The Kubernetes telemetry aware scheduler is used to deploy pods based on platform telemetry collected and stored with the Collecti and Prometheus operators. This forms the basis for this platform resiliency demo and more info on this in a few slides. For this demo, two Intel Xeon 6-230N platforms are used in this Kubernetes cluster. However, the platform telemetry is available across all Intel servers. For a list of available Collecti plugins, please see the link to OP NFV barometer provided. In Kubernetes, an administrator can define various objects using manifests written in YAML. These manifests define objects such as deployments for how to start a pod, made up of one or more containers, services for exposing service ports, and ingress in Kubernetes or routes in OpenShift for accessing the application from an external network. Instead of defining all the manifests manually and then having the overhead of managing the setup of different instances, an operator can manage these objects for you by defining a custom resource definition extending the Kubernetes API and the custom resource object. Operators enable the automated management of these objects, resulting in better object life cycle management, ease of use, and deployment. The Collecti operator defines a new API interface called collecti.info.watch. Once the Collecti operator is loaded, an instance of Collecti can then be started on all nodes by creating a Collecti object. When defining the Collecti object, the operator manages loading the Collecti configuration with a list of plugins and their parameters. The Collecti object will then be watched by the operator and will manage the creation of all dependent objects, resulting in Collecti running on all nodes. The operator framework allows easier management and configuration of Collecti and its plugins in a Kubernetes environment. Within Kubernetes, there is no mechanism for using telemetry and scheduling decisions, and as a result, there is no workload migration in advance or during node health deterioration. This can lead to service disruption and application downtime. Telemetry aware scheduling, or TAS, is an extension to the Kubernetes scheduler that uses telemetry data to make pod scheduling decisions. Through a rule-based and user-defined TAS policy, scheduling decisions are made with up-to-date node metrics. By making TAS aware of node telemetry, pods can be orchestrated more intelligently, and as a result, reduce service disruption and application downtime. Taking a look at an example scenario, a workload will be scheduled based on the node CPU usage. Firstly, the policy CPU used is created, specifying to schedule pods on nodes with CPU usage of less than 50%. Then, a pod is specified with the CPU used policy. When the pod is to be scheduled, the scheduler asks TAS for input. TAS reviews the policy and gets associated metrics from the metrics API. All nodes are then ranked and prioritized by their CPU usage, and the scheduler makes decision based on the priorities and schedules the pod to node B. To view a work demo of the closed-loop resiliency components working together, please view the accompanying video. This demo showcases how components from InfraWatch, combined with host telemetry, enable a zero-touch, automated, and resilient network infrastructure. CollectD can also be deployed using the same tooling and OpenStack platform to enable host telemetry within the service telemetry framework. Combining these elements into a closed-loop solution provides a starting point for tomorrow's next generation self-healing and self-optimizing networks.