 Hi, I am Neil Oliver and I'm speaking also for my colleague Kanan Babu Ramya. We are principal engineers in the network platforms group at Intel. Our topic is green computing as it applies to edge computing platforms. So-called green computing refers to the technologies and design choices that allow computing services to be provided to people and reduce its impact to the environment. Computing can be made greener in many ways. It can use less power by taking advantage of energy efficient hardware and software. It can use ruggedized hardware that doesn't require as much secondary power usage such as for cooling. It can use green power sources such as solar, wind and hydroelectric generated electricity. It can source hardware elements that are less damaging to the environment. Cloud computing is already a voracious consumer of energy resources and as edge networks come online green computing will become very important in these domains. We work on the Intel SmartEdge project which is developing and deploying both open and commercial high-performance edge computing platforms. As part of this project we are developing building blocks for green edge computing and would like to describe our work. We will describe the use cases of principal interest and note the different use cases that appear in the cloud and the edge. Since our building blocks presuppose CPUs that provide sophisticated power management we will review how these work and how we can take advantage of them. We then move up to the software level and talk about how edge use cases lead us to focus on telemetry and scheduling for applications. Finally, since our project is work in progress we will discuss next steps and invite interested partners to collaborate with us in deploying green edge networks. Up to a point cloud and edge computing have common use cases and can use common solutions. We will talk about cloud use cases as it will make it easier to understand edge use cases. Clouds derive their value from economies of scale so incremental reductions in power usage yield large cost benefits. These can accrue to cloud operators, to application providers and to end users. Because of their large and increasing energy demands, clouds are attracting the attention of regulators. Sometimes by mandate and sometimes in the spirit of public spiritedness, large enterprises and clouds are increasingly finding it in their interest to reduce those energy demands. Some use cases, such as selection of renewable energy sources, have consequences for architecture and design. Renewable energy sources may also be volatile sources. For example, no wind, no wind power. A cloud that can select its energy sources may need to use them judiciously to maintain continuity of service. The easiest way for a cloud to reduce power usage is to provision energy efficient infrastructure. This obviously includes servers but also includes accelerators of various kinds such as for media processing or AI inferencing. Because cloud data centers are so large in scale, special purpose hardware need not be a scarce resource and need not have much impact on scheduling the workloads. Another reasonable approach is to optimize workloads for power usage. For many cloud applications, this may not be a good choice because time to deployment may be more important than extreme optimization. However, many network functions such as ran and core are both heavily used and power intensive and optimizing them as advantageous. In a cloud data center, workload scheduling can be done on the basis of CPU utilization as networks inside a data center incur essentially no latency. Scaling up and down of workloads on servers with different power management configurations is often good enough. When we look at edge computing use cases, we can see that some of the same issues apply and so some solutions such as energy efficient hardware selection and app optimization are also appropriate. However, edge platforms have constraints that do not apply to cloud data centers and these drive different use cases. Edge platforms are more constrained in size in clouds and scaling beyond an edge platform typically incurs a latency cost. After all, the point of an edge network is to move compute resources so as to reduce network latency. Since an edge platform can't achieve unlimited compute capacity, it has to get by with high compute density and may have to have special configurations for critical applications. So the scale up, scale down scheduling methods that work well in a cloud will be harder to accomplish in an edge platform. Also, if an edge platform is in a remote location, some power management constraints may be hard constraints. For example, there may be an absolute cap on the available power or the power may be supplied by a battery whose charge must be managed. To deal with these use cases, app optimization is still helpful and scheduling among platforms is still useful but more difficult. Making applications energy aware now looks like a more interesting solution for edge platforms. For example, an application could be notified that it needs to be moved to a lower capacity core in order to save battery power. If it was developed for an edge environment, it might be able to accommodate a relocation by operating in a reduced function mode. Now that we have looked at edge use cases and possible solutions, we will start to look at the resources at our disposal. We will first look at how power management operates in modern CPUs. In general, we will see that there are tradeoffs between power consumption and compute performance, so this is often referred to as power performance management. Hardware based power management is fundamental to green edge computing. The ability to reduce the frequency of a core and the ability to put a core into a deep sleep state where little or no power is consumed are among the most significant actions that can be taken to reduce power consumption of a platform. Doing this well is difficult. Every additional degree of freedom of changing the frequency or power state of a CPU or of its cores leads to more states at an edge platform can be in. And because there are more states, there are also more state transitions and some of them will cause bugs in programs if they are not managed well. A high level point of view, the basic actions that can be taken are to modify clock frequency on a device. The device can be a CPU, a core, a memory, or storage device, or a network interface, or to change the frequency of a CPU or a core. Some more sophisticated actions are those where an energy based or temperature based pattern action rule can be established in a CPU. This is like doing power management from software but with much better response times because everything is embedded in the hardware. Let's look at some of the options. This slide depicts a variety of hardware modes that modify power consumption of a processor. Of course, power and performance represent a trade-off. First let's consider the C states. These are occasionally called sleep states because they halt various levels of processor activity. C0 represents a fully on state in terms of clock, caches, voltage, and power consumption. Other C states above 0 halt the processor in the sense of turning off the clock but other parts of the device such as caches may still be on and the core still has a certain idle power level. The idle power may be nearly as high as power in active state for state C1. States at flush caches reduce idle power to a much lower level. The cost of low idle power, though, is long wake-up time when it's time to turn the core back on. The amount of wake-up time that can be tolerated is a function of the program being run on the core. A packet processing core, for example, may not be able to tolerate a high C state because of the risk of dropping packets. C states can apply to an entire package or can apply to devices as well as well, so called D states. C states or performance states control the operating frequency and voltage of a core. In principle, this is a trade-off of power for execution time. For real-time or time-sensitive software, it may not be possible to use the P state to power-reduce a core. Further, there are P state dependencies for multi-core CPUs that constrain the ability to independently change the P state of an individual core. This complicates scheduling algorithms. In C states and P states, certain additional technologies are available. Turbo Boost provides a means of trading off frequency against temperature. The core that gets too hot is throttled back by the hardware to prevent damage, but this feature provides for dynamic behavior to obtain higher performance. It is of itself not an obvious means of power reduction as it was designed to optimize the processing speed of the core against the heat. While speed select technology is a complex feature with multiple modes, two of its most interesting modes for green computing are performance profiles and class of service configurations. Performance profiles allow a predetermined profile specifying which cores are active and other parameters and then to select a profile to be active. A typical use of this is to trade off the number of active cores against the maximum core frequencies available on the cores. It is typically used to give a boost to critical workloads by giving them a fast core to run on, but the concept of profiles gives a scheduler a tool to manage the power performance of the CPU. Class of service or CLOS configurations allow configurations to specify maximum and minimum core frequencies to prioritize the configurations and to tie cores to them so that when throttling is required the cores that are impacted can be chosen by the users of the system. Running average power limit, RAPL, is a processor feature added specifically to manage for power consumption at the hardware level. Power limits and averaging windows can be adjusted, RAPL power and thermal data is reported as telemetry and values of this telemetry can be used as triggers for scheduling decisions. At this point we understand some of the tools at our disposal to manage the power consumption of the CPU and understand the performance implications. To deliver green computing to end users we need to integrate it into an edge platform. For us that means the Intel SmartEdge Open platform. Intel SmartEdge Open is a software toolkit for building high performance edge platforms. It exists as part of a family of edge computing solutions under the general name Intel SmartEdge which also includes commercial solutions. SmartEdge Open has a Kubernetes based infrastructure along with additional agents, resources and other components. The infrastructure along with the additional components can be composed into a fully functional edge computing platform. The range of components includes various virtual networking components, virtualized radio access network and core implementations, databases and image stores and reference implementations of applications. The project ships a variety of experience kits which are validated platforms designed for the most important use cases that our customers have brought to us. As a toolkit however a partner or customer with an existing Kubernetes based platform can integrate components into that platform. This talk is not the right place for a deep dive into SmartEdge Open but a brief walkthrough is necessary in order to explain how it supports green computing. In the figure look at the box on the right labeled SmartEdge Edge Cluster Multi-Host. This is an edge platform or cluster with a separate control plane on the top and an edge node on the bottom. In both the control plane and the edge node the infrastructure components are shown in gray. Above the hardware is the host operating system. We currently support Red Hat and Ubuntu. Now that is the containerization layer made up of Docker and Kubernetes. Above the infrastructure are a collection of SmartEdge pods, each containing a wide variety of components used to operate the system. Together they provide the ability for the various components to set up container networks and manage the lifecycle of applications which includes scheduling, descheduling and relocation and observing which includes telemetry. This platform attestation provides the ability to make the platform a trusted platform. The number of SmartEdge components is too numerous to even a list here and different experience kits contain different combinations of components. Take the depiction on this slide as just an example. The box labeled SmartEdge all-in-one single node cluster depicts the same functionality that we just covered but in a different packaging allowing SmartEdge configurations to fit different hardware footprints. Customer apps can cover a wide variety of functionality. Generally speaking if they can run Linux images in a Docker container they can run in a SmartEdge platform. SmartEdge uses Helm to deploy applications as well as the SmartEdge platform itself. A growing number of applications both developed by Intel and by third parties are validated to run on a SmartEdge platform and can be obtained through the Intel Edge software hub. At this point we understand that there is a control plane and edge node and SmartEdge components that interact with each other. There are two major differences between this slide and the previous one. First there are several SmartEdge components that were not shown previously. These components exist to collect telemetry including power telemetry and to act on external commands to modify the scheduling of applications and the power management configurations of the hardware. Second there is an external system, the orchestrator, that is responsible for decisions concerning the scheduling and execution of applications over all of the Edge platforms under its control. An orchestrator may provide a user interface for use by a human system administrator. This is a common mode of operation in fact. The orchestrator receives telemetry information which is visualized for the system administrator. The administrator makes decisions and the orchestrator directs the Edge platform to carry them out. The orchestrator may be able to operate without a human in the loop, at least not directly. The orchestrator can provide a method for defining policies for managing applications and platforms. The policies are created by a system administrator but executed automatically. When a policy is triggered an action is sent to the scheduler which instructs the Edge platform to take action. The orchestrator depicted here however has an analytics component that processes telemetry according to a trained model, makes orchestration decisions for the scheduler which in turn instructs the Edge platform to take action. An orchestrator may provide all three modes of operation, human policy and model based. They are all however dependent on telemetry received from the Edge platform about the applications. This is the critical part. A modern server can provide power and temperature readings from all of its cores, peripheral devices, fans, chassis, and so on. But the orchestrator needs to take action based on the behavior of the applications. Now let's return to the Edge platform and the new components. Telemetry is collected and aggregated by the telemetry agent. This agent in general can collect metrics from any hardware or software component that provides or simulates a sensor. In Smart Edge Open the telemetry agent currently uses an open source package called CollectD but other telemetry collection packages exist and could possibly be used. The telemetry agent correlates power metrics collected from the hardware cores and other devices to hardware on which pods are running. Making the reasonable assumption that an application pod corresponds to an application, this allows power consumption to be attributed to the execution of the application. This is what we refer to as application aware telemetry. It is supplied to the monitoring system Prometheus. Prometheus is an open source system monitoring and alerting package used to receive telemetry from the telemetry agent in a uniform standardized way. Other components such as the orchestrator and the power management controller can receive alerts and query it for data. The power management controller is an executive function that has the ability to invoke power management changes on the infrastructure. It is designed to be a Kubernetes component. It implements custom resource definitions or CRDs so that other parts of a Kubernetes system can invoke APIs on it. This means that when the orchestrator human policy trigger or AI decides that an edge platform needs to change the power mode of the core it invokes an API on the power management controller in the standard Kubernetes manner and the controller takes a requested action. The power agent corresponds to the power management controller. It is analogous to a device driver and translates power management commands from the controller into actions on the hardware infrastructure. This was a really long explanation so let's summarize. The orchestrator makes decisions about the functioning of the edge platform in the applications. These actions can be to start or stop applications or to change the power management settings of the hardware. These actions are scheduling actions and are communicated by the scheduler. The scheduler has familiar Kubernetes resources that it uses to do this. The Kubernetes scheduler in the platform and the power management controller. The orchestrator provides an automation loop which receives telemetry from the platform, analyzes it in its various ways and responds by telling the scheduler to take action. The components in the edge platform provide power telemetry in a standardized fashion and provide agents that control the power management to the hardware. In the previous slide when we described the telemetry agent we indicated that it uses a collectee open source package for metrics collection. Collectee is a very flexible package which uses a plugin architecture to collect different types of metrics. It currently implements dozens of plugins for many different applications. The metrics of interest for power management are highlighted in the slide. More information on them can be found in the URL at the bottom. Earlier we noted that the orchestrator has a scheduler that could take actions affecting application scheduling and hardware power management. We will look at a couple of quick examples to show how this works. In this example we look at pod scheduling. The automation in this loop is policy-based. It has a policy rule that forces rescheduling of an application pod. The actual mechanics of this are carried out by Kubernetes. Kubernetes has the ability to de-schedule and reschedule pods as part of its normal operation. By composing appropriate de-schedule reschedule operations a pod can be relocated to a different edge platform. This is another example. It depicts an automation loop that uses a model to forecast utilization. In terms of number of applications it will increase or decrease and decides to expand or consolidate applications based on the information. Movement of the actual application pod from a currently active node operates as shown in the previous slide. However, relocation of the pod may require that a node be activated and moved into an active C-state or possibly in the other direction after de-scheduling a core may be put into a sleep state. The scheduler sequences these actions appropriately and interacts with the reschedule de-scheduler and with the power management controller to take the actions. In discussing the orchestrator we pointed out that orchestration decisions could be done in various modes including a human in the loop mode. For that to happen the human needs to be able to visualize the telemetry data. This is an example of a telemetry dashboard depicting power telemetry. It is implemented using the Grafana open source package which supports a variety of widgets for time series visualization and the ability to customize dashboards. In this dashboard we see two widgets that display power consumption data for the pods that are active in the edge platform which are obviously correlated with applications and the power consumed by each socket which is strictly a hardware measure. For a system administrator a dashboard visualization can be used to orchestrate a system. For a policy-based orchestrator data sets collected in this manner are analyzed manually resulting in policy pattern action rules to be executed automatically by the orchestrator. For a model-based orchestrator data sets collected in this manner are inputs for offline model training or for reinforcement learning models. As we explained earlier when the orchestrator chooses an action to take that action is in terms of scheduling pods and changing the power management configuration of the hardware. The actions that can be taken such as setting C states and P states must be defined in terms of the orchestrator can select and the power management controller and power agent can carry out. This table describes at a high level what those actions look like because all these technologies work differently the actions cover a lot of ground. In many cases bios settings must be changed for example with turbo boost depending on whether the scheduler wants direct control the feature must be disabled in bios. Speed select technology is a more interesting case because it has many sub features and to work properly the scheduler needs to know what features are available. Smart Edge Open has a component node feature discovery or NFD that inventories an edge platform to determine what hardware is available together with the hardware's capabilities. An additional component called telemetry aware scheduling or TAS works together with the Kubernetes CPU manager to select landing zones for pods based on that inventory of capabilities. Well here we are at the end of the talk. The work we described is work in progress. Our in-house experiments have demonstrated the feasibility of application aware telemetry and orchestration to implement the fundamental procedures of greenedge computing namely orchestration of applications and power management to the underlying hardware. Our next steps as Smart Edge Open is deployed are to package this work as Smart Edge components and get it into the hands of users. Telemetry collection and monitoring visualization for the human in the loop mode of operation will be first followed by policy and model based orchestrators. Back in the section on use cases we mentioned the need for building energy aware applications. This is rather forward looking work because most cloud and edge applications are not designed to consume telemetry. We see a need to work with app developers and standards of organizations to develop frameworks for telemetry for a variety of reasons beyond energy awareness. We would look forward to conversations with anybody in the audience who is interested in collaboration on this. Well thank you very much for me and my colleague Kanan for attending our talk. We hope you have a fun conference. Goodbye.