 In the accompanying video, each of the closed-loop resiliency components and their roles in this demo were discussed. To view that video, please follow the link provided. In this video, we will combine those components to realise a platform resiliency use case. The last-level cache load misses a memory bandwidth usage of each node is monitored using the Intel PMU and Intel RDT plugins enabled in the CollectD operator. These metrics are then stored in Prometheus and made available to TAS via the custom metrics API. For the demo scenario, a sample Nginx deployment is created and linked to TAS with the resiliency TAS policy. TAS will only schedule pods from this deployment. Memory bandwidth usage and LLC load misses are used to indicate node health, as deploying a latency sense of application on a node with high levels of LLC misses and memory bandwidth usage can result in significant jitter and latency increases. The resiliency TAS policy defines three states of node health and associated scheduling and descheduling rules. The healthy state, where the node is healthy and pods can be scheduled. The warning state, where the node is unhealthy and no more pods will be scheduled, but the existing pods will remain on the node. And the critical state, where the node health is critical and existing pods will be rescheduled on a healthy node and no more pods will be scheduled on this node. Firstly, with both nodes in a healthy state, one Nginx pod is created and deployed to node two. StressNG is then deployed to node two to stress the CPU. This increases the LLC load misses and memory bandwidth usage. With StressNG stressing node two, the LLC load misses and memory bandwidth usage increase to put the node into a warning state. Then when the deployment is increased to three pods, TAS recognizes the warning state of node two and schedules the new pods on the healthy node one. The first pod remains on node two as per the resiliency TAS policy. StressNG is then increased to trigger critical state in node two. As per the resiliency TAS policy, pods on a node in a critical state will be descheduled and redeployed on a healthy node. In this case, the first pod will be descheduled and redeployed on a healthy node one. And now to see the demo in action. On the left-hand side of the screen, there is the memory bandwidth usage and LLC load misses for node one and node two with the warning threshold in yellow and the critical threshold in red. On the top right, the Nginx pods and what node they are deployed to are listed. And on the bottom right, the TAS logs are shown detailing the current scheduling decisions being made. Firstly, one Nginx pod is deployed to node two, scheduled by TAS. StressNG is started on node two. This increases the LLC load misses and memory bandwidth usage to the warning state. Two more pods are then created and scheduled to node one as node two is in a warning state. The first pod remains on node two. StressNG is increased to trigger a critical state in node two. TAS then recognizes this critical state in node two, deschedules the Nginx pod on node two and a new pod is deployed to the healthy node one. This demo showcases how components from InfraWatch, combined with host telemetry, enable a zero-touch automated and resilient network infrastructure. Collectee can also be deployed using the same tooling in OSP to enable host telemetry within the service telemetry framework. Combining these elements into a closed-loop solution provides a starting point for tomorrow's next-generation self-healing and self-optimizing networks.